Search Engine Optimization

Crawl Rules Of Famous Websites

crawl rules of famous websites
0

Every websites have their own rules for web crawlers to crawl their site. Some of them might not! Let us peek into the robots of some of the internet rulers and see if there are any hidden treasures.

First let us look at the search king Google’s robots.txt file. They have got a very neat robot.txt file. There isn’t anything special and have clearly defined all allowed and disallowed pages.

facebook logo

Facebook has got a very strict crawling policy. One will require written permission Facebook to crawl their site. Here is the story of a software engineer who got sued by Facebook for crawling their website.

Twitter has defined crawl rules for several web bots. Interestingly they have provided a sitemap that shows “Access Denied“!

2015-11-13_17h19_15

The robots of Wiki is just like it’s all other pages literally. They got some serious and one of the most detailed robots.txt file that’s ever seen. From here we would be able to find and identify several web crawlers that we haven’t ever heard of. The language is so polite that it pardons some crawlers for being blocked. Screenshots of some interesting parts are shown below:

wikipedia robots

wikipedia robots

2015-11-13_17h27_15 

Pinterest & TripAdvisor have got their job opening for SEOs listed in their robots. So if anyone interested, go to their robots, get all details and apply now! It’s sometimes common that companies provide job openings in their website’s source codes too.

stumble upon logo

Towards the bottom of the robots file of StumbleUpon, you can see something crazy :

stumbleupon robots

The guy who made the robots is fan of Dr. Zoidberg 😀

zoidberg

yelp logo

Yelp needs all the robots to obey Asimov’s three laws displayed in their robots file:

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2. A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

the whitehouse & examiner

The robots files of The White House website and Examiner.com are using just the template robots.txt files for drupal websites. Check out the two robots below:

The White House:                                                      Examiner.com: 

Do you have anything more interesting than these. Share it in comment and we will include it in the blog!

N.B.: There might be some other purpose behind this blog. Hope at least some of you will find it out!!!

Renz Joe David plunged into the world of Digital Marketing in 2013 and since then he has been passionate about exploring new areas in SEO. His interests lies more in Link Building, Lead Generation, Social media, Reputation Management, Analytics, Penalty abatement and Marketing tools. Besides profession he likes technology and driving.
Renz Joe David
Renz Joe David plunged into the world of Digital Marketing in 2013 and since then he has been passionate about exploring new areas in SEO. His interests lies more in Link Building, Lead Generation, Social media, Reputation Management, Analytics, Penalty abatement and Marketing tools. Besides profession he likes technology and driving.
You may also like
Explaining Custom 404 Error Page Implementation and Soft 404 Error Issue
Explaining Custom 404 Error Page Implementation and Soft 404 Error Issue
Effective Ways to Build Trust for Your Brand for Better Conversions
Effective Ways to Build Trust for Your Brand for Better Conversions