Google admitted in their Webmaster Central Blog on January 16, 2017, that they don’t have an official “crawl budget definition”, indicating that a single term would not suffice. They went on to indicate that they would clarify what they actually have and what it means for Googlebot.
In this article, we will explore what Google has told us about site crawl budget and how it operates.
What is Crawl Budget?
As an SEO, when we talk about crawl budget we are referring to the resources Google will allocate to crawling (or discovering) the pages on your website. The budget could be determined by number of pages and or the time Google will spend crawling. That’s right – Google has limits on how long and how many pages of your website they will crawl!
Should I Be Concerned About My Site Crawl Budget?
For pages crawled on a daily basis, Crawl Budget is not a high priority concern for websites with fewer than 1,000 URLs. As long as there isn’t anything blocking Google out of the site there shouldn’t be issues crawling these pages.
Google stresses that website crawl budgets are more improtant for larger sites.
What are Google’s Limits to Site Crawl Rates?
Crawling is a priority, but Google claims it does so without degrading user experience on the site, instituting crawl rate limits for given sites to minimize that possibility. They define crawl rate limits as the number of simultaneous connections used by Googlebot to crawl, and the wait-time between fetches.
What is Crawl Health?
On websites that respond rapidly to Google crawling its pages, the crawl limits go up, using more connections, and allowing Google to crawl and discover more and more pages. This allows Google to index and rank more and more pages.
For slower sites or those with server errors, the crawl limit is throttled back and fewer pages are crawled as the Googlebot has to wait longer to crawl each page.
Google also reminds website owners you can manually set limits for crawling inside your Google Search Console, but that setting higher limits does not mean they will automatically increase the crawl on your site.
How Frequently are Crawl Demands Made?
If there is little demand from indexing (even when peak crawl rates haven’t been reached) there will be little Googlebot crawl activity.
More popular URLs are usually crawled more consistently to keep Google’s index up-to-date, as another objective for the crawler is to prevent URLs from becoming stale.
Crawls can also be triggered if a website is moved and new URLs need to be re-indexed.
Ultimately, the rate and demand are what define a crawl budget in Google’s eyes.
6 Ways to Improve Your Website Crawl Budget According to Google
Google says that crawl rate is negatively affected by websites with numerous low-value-add pages and categorizes these types of websites into 6 main groups, listed by order of importance.
If your site is large enough to have a Crawl Budget, or if you want to prevent your site from falling under a Crawl Budget, address the following prioritized issues, as applicable.
1. Address URLs with naturally duplicated content.
Faceted navigation (ability to filter pages by price, colour, size, etc.) affects the crawl budget because they contain many combinations of a URL with duplicated content. These prevent Google from crawling new and unique content as quickly or index pages correctly as a result of diluted signals between versions that have been duplicated.
Session identifiers also fall under this list. User info and tracking information stored in these URLs cause duplicated content through the numerous URLs used to access a single page.
2. Minimize duplicate text content from page to page.
On websites with content duplicated across several pages, Google uses algorithms designed to prevent this duplicated content from adversely affecting user-experience or webmaster-experience. Here’s how Google deals with this duplicate content:
- When duplicate content is found, the URLs are grouped into a cluster.
- The best URL that represents the cluster is chosen and presented in search results.
- Properties such as link popularity within the cluster are consolidated and applied to the chosen URL that represents the cluster.
3. Mark old deleted pages with 404 Not Found response.
Crawl rate can be affected by soft error pages that occur when a server responds with a 200 OK response if a page does not exist instead of a 404 Not Found which is more appropriate. This limits site crawling because these old deleted pages might be crawled instead of other live pages on the site.
4. Deal with any hacked website pages and content ASAP.
If your site has any hacked pages, well, just don’t expect Google to crawl it anytime soon.
5. Make sure that each URL on your website has its own unique purpose.
Sites affectionately called infinite spaces (sites with an excessive number of URLs) are not high on Google’s list of crawlable content. Make sure that each page you want Google to crawl has its own unique purpose and message. What need does it meet for your visitor?
6. Purge any low quality or spam pages.
Pages containing low quality and spam are right up there with the hacked pages and will negatively affect the crawling rate of the site.
Google determines all six of these areas as a waste of their resources and delays them from discovering the great content of a website.
If you have any questions about Crawl Budget, whether or not your site contains any of the items that might negatively affect your crawl budget, give 1st on the List a call at 1-888-262-6687. You can also reach us through email at firstname.lastname@example.org.
Learn more about How Search Engines Work to crawl, index, and rank your website!