Google crawlers have developed so smart that nowadays we don’t even need to submit a url for Googlebot to get it crawled. A page can even get crawled by Google just by loading the site in Google Chrome (Chrome is considered to be one of Google’s crawler). So before coming to the point, let’s take a deeper look into what are crawlers. You can skip the basics by clicking here.
What Are Crawlers:
Crawlers, also known as web bots, spiders or automatic indexers are computer program that reads everything they find in the world wide web. Crawlers are used by search engines or other websites to gather information about other website’s content. The process of indexing a web page is known as crawling and during which an entire copy of the web page is kept within the crawler’s database.
When it comes to online searches in search engines, the search results displayed will be from the crawled list of webpages within their database. This allows immediate results for search queries rather than searching the entire web. Crawlers can follow hyperlinks and HTML codes which let them find other websites and web pages to add to their database.
When a search is done, the page containing the content matching with the search query will be displayed on the top. Of course we know that ranking position of web page in search results depends on several other factors.
Before moving on, check out:
How Googlebot Behaves:
Googlebot is algorithmically programmed to determine whether to crawl a site, how many pages and how often. Google Bot frequently crawls the web to update and build their index. For the Google’s spiders, there are several factors that determine the crawl frequency for an individual website. And Googlebot is getting more and more smarter and quicker day after day. And currently Googlebot is capable of indexing an orphan website just because it’s been loaded in Chrome. But it should not always be considered as a grant.
As the crawler pass through several pages and reads the html codes and anchor texts, it determines the next pages to crawl by following certain criteria. Adding “no follow” tag to a hyperlink tells the crawler to not to follow that link and crawl the linked website, but usually this is not being done. We know that when it comes to good authoritative sites, almost all the links pointing to other sites are no follow and it doesn’t mean that the web crawlers are not following them. So adding no follow tag to a link just reduce the importance of that link but most of the crawlers especially Googlebot will follow and crawl the particular website pointed by that link.
When it comes to complex pages with dynamic content, the crawler may not see the page like a user see it. Flash content, animations, on-site search results, forms and other dynamic resources may not be identified by crawlers similar to a user. But recent progress have made Googlebot much smarter that they can even read flash content and other dynamic resources. Let us look deeper into that in the later section.
Sitemaps help crawlers to easily identify all the pages to be crawled. So build a sitemap for crawlers with the help of any sitemap generators. Along with a sitemap for crawlers it is also good to build one for users too. Google recommends to build one for search engines and one for users. After generating the sitemap, don’t forget to mention it in the robots.txt file and submit the sitemap in Search Console.
For websites with large number of URLs, say more than 50,000 URLs, Google says to break the sitemaps into multiple sitemaps. Google don’t support sitemaps with more than 50,000 URLs or 10 MB. You can check out Nick Eubanks’s case study here where had done some tweaks with the sitemap of a website that improved its index rate.
2. Good Quality Backlinks:
The number and quality of links to your site can directly influence the crawl frequency of your website. Get maximum good quality links pointing to your website and improve the crawl rate and of course good links can lead to good rankings in serp.
3. URL Parameters:
When it comes to traffic from sites like Feedly, Feedburner, Facebook, etc, the url can include parameters as shown below:
For E-commerce sites too there could be different parameters in URLs when sorting a list of items like sort=relevancy etc. You can find all such URL parameters from the search console:
Here we have the option to add more parameters and can define how these parameter affects the page content. Also we can set crawler to avoid certain parameters that causes content duplication:
This video from Google clearly explains what are URL parameters and how it works and affect a website’s crawling :
4. Good & Engaging Content:
Developing good content regularly and frequently helps search engine to identify that the website is new and is been updated regularly. Such engaging site will get more attention from crawlers. Developing good and useful content leads to a numerous other benefits for the websites. Some of them are improving traffic from relevant users and improving user engagement.
5. Site Loading Speed:
Website loading speed can have an impact on how search engine crawlers behave in your website. Determine your site loading speed using this tool and find out if there are any areas to improve or not.
6. Social Engagement:
Social pages are good to drive engagement from social media which is one of the biggest referral source. Create social pages for your website in popular social media sites like, Facebook, Twitter, Google+, LinkedIn, etc and share your page as soon as you launch it. This will help your page to get indexed immediately. We know that how quickly tweets are being indexed by Google. Sharing pages in other communities like inbound.org (for internet marketing field) will also helps you get your page noticed by relevant users.
Get Your Website’s Crawling Statistics:
Google Search Console shows you the crawl statistics of your website. From the graph we can identify whether the crawl rate is in good condition or bad. If there is an upward trend, then don’t bother otherwise read this blog again and find out the area where you should give attention to.
When you see a spike remember whether you had made any changes to your website? Or check if Google updated its algorithm? From this graph you will get a clear idea about how your website is being crawled by Googlebot.
Some Useful Facts Regarding Crawling:
These are some important facts that can be kept in mind for future use:
- You can adjust the speed of an incoming crawler with the help of robots.txt file as shown below:
The above command will give 1 second delay to all the crawlers
- Go to the index status in Search Console to see the number of indexed pages or just type in site:yourdomain.com in Google’s search field and see the count. Keeping a track of it on a monthly basis can get you an idea of your website’s index status.
- In Search Console, there is an option to limit the crawl rate of your website. If your site is hosted in server with limited bandwidth, try to limit the crawl rate otherwise just leave it so that Google will automatically optimize crawl rate for the site.
- When an existing web page is updated with new content, get it re-crawled by Google following this method.
Can Crawlers Read Dynamic Content in Ajax sites?
When a crawler requests for a page content, the website server returns an HTML snapshot of the URL to the crawler and the crawler processes the HTML snapshot and extract the content and the URLs. If a webpage has dynamic content, loading through ajax, which means that there are hidden content that displays only when a user clicks on a particular link, the crawler may not able to identify or index those content as the content will not be included within the HTML snapshot.
“We are no longer recommending the AJAX crawling proposal we made back in 2009“
And Mark Munroe has written a detailed blog on on search engine land regarding Google’s new Ajax recommendations and what should we do.
Still Not Getting Crawled:
Unless you haven’t done anything manually, there could not be any other reason for your website to be not crawled by Google. Here are some of the tools/methods that are used for manually blocking a crawler:
- Using robots.txt file
- Using .htaccess
- Meta tags within a page
- Wrong URL parameter settings in Search Console
Check all these and make sure you haven’t blocked the crawlers in any ways. And if you are still getting de-indexed by Google then the only possibility is that your website is under penalty. Check your backlink profile, clean it up and submit a reconsideration request.
I’m I wrong? Do you have anything more to add. Please put in your thoughts and suggestions