Every websites have their own rules for web crawlers to crawl their site. Some of them might not! Let us peek into the robots of some of the internet rulers and see if there are any hidden treasures.
First let us look at the search king Google’s robots.txt file. They have got a very neat robot.txt file. There isn’t anything special and have clearly defined all allowed and disallowed pages.
Facebook has got a very strict crawling policy. One will require written permission Facebook to crawl their site. Here is the story of a software engineer who got sued by Facebook for crawling their website.
Twitter has defined crawl rules for several web bots. Interestingly they have provided a sitemap that shows “Access Denied“!
The robots of Wiki is just like it’s all other pages literally. They got some serious and one of the most detailed robots.txt file that’s ever seen. From here we would be able to find and identify several web crawlers that we haven’t ever heard of. The language is so polite that it pardons some crawlers for being blocked. Screenshots of some interesting parts are shown below:
Pinterest & TripAdvisor have got their job opening for SEOs listed in their robots. So if anyone interested, go to their robots, get all details and apply now! It’s sometimes common that companies provide job openings in their website’s source codes too.
Towards the bottom of the robots file of StumbleUpon, you can see something crazy :
The guy who made the robots is fan of Dr. Zoidberg 😀
Yelp needs all the robots to obey Asimov’s three laws displayed in their robots file:
1. A robot may not injure a human being or, through inaction, allow a human being to come to harm. 2. A robot must obey orders given it by human beings except where such orders would conflict with the First Law. 3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
The White House: Examiner.com:
Do you have anything more interesting than these. Share it in comment and we will include it in the blog!
N.B.: There might be some other purpose behind this blog. Hope at least some of you will find it out!!!