π This article explains issues related to bot crawls that can impact site performance.
Overview
Using a crawler like Botify is a great way to get a better understanding of how your website and servers respond to different situations. Although the Botify crawler does not replicate search engine behavior entirely, problems that the Botify crawler encounters are often problems that search engines encounter.
Botify can crawl up to 250 pages per second. But you need to determine the speed your website can tolerate without degrading performance for users. Crawlers exploring your website cannot be easily compared to typical user traffic. Some reasons why crawlers may impact web server performance are identified below:
Difference Between Crawls and User Visits
In most cases, a crawler will request fewer pages per second than the sum of the website's concurrent users. In that case, we might expect the additional load generated by the crawler to remain negligible and unnoticeable from a performance perspective. However, a website undergoing a fast crawl will often have a performance drop and start returning pages much slower. This is because crawlers and users behave differently when accessing web pages and do not apply the same stress points to the server, and optimization mechanisms meant for users do not usually work well for crawlers.
Crawlers are designed to explore all pages and never request the same page twice. Users often request the same pages as others and never visit some pages (e.g., search pages, filter combinations, deep pages).
Unlike user visits, a crawler does not use or accept cookies.
Compressed Files
Botify crawls your site with GZIP (i.e., "Accept-Encoding: GZIP" HTTP header) enabled by default, though you can disable it in advanced crawl settings. Check with your hosting/infrastructure team if you are unsure whether it would be beneficial to disable this setting. The impacts of crawling with or without a GZIP option are:
A crawl without a GZIP option on a website that delivers compressed, pre-computed (cached) pages very fast requires the server or proxy to decompress all pages, which induces a heavy computing load.
A crawl that accepts GZIP and requests pages that are not pre-compressed induces CPU-intensive processing if the server systematically delivers GZIP pages.
What can you do?
If you suspect your web server does not deliver pre-compressed GZIP pages but compresses them at the time of the request, you should disable the GZIP option, especially if you plan to crawl faster than ten pages/second. It is imperative to consider the available bandwidth vs. available CPU to find the right compromise. For more information on GZIP, here is an excellent article.
Session Management
Any website that records user sessions (e.g., using a cookie with database records or a memory cache) will slow down when pages are requested by a crawler: a user who visits 25 pages within ten minutes will only trigger one write and 24 reads. A crawler will trigger 25 writes, as it does not send any cookies and creates a new session with each page. Since write operations are among the slowest for servers, this impacts response time. Also, these sessions will "fill" the system - crawling one million pages will create one million sessions.
What can you do?
Do not create a session when the request comes from known robots, especially if it is recorded in a database. You can do this for Botify when you crawl your website, and, more importantly, for Googlebot, which crawls constantly. Alternatively, do this for search engine bots only and use the custom user agent option to crawl your website with Botify using a Googlebot user agent.
Cache System Inefficiency
Many websites are behind a cache system. This type of optimization is very efficient for users because they tend to visit the same pages: one million page views can easily correspond to no more than 1,000 unique pages. The cache system enables the site to be very fast for these user requests. But a crawler that requests one million pages will request one million unique pages. This can lead to a situation where requested pages are never in the cache, straining web servers and making the website slower for bots and users. If crawled pages are stored in the cache, the cache system will fill up with pages that users rarely ever request, if at all, at the expense of pages that were often requested by users. This makes the cache even less efficient for everybody.
What can you do?
Set up the cache system to apply specific rules to robots: deliver cached pages if they are available (cache hit), but never store a page that was not in the cache (cache miss) and was requested to the server. The cache must only store pages requested by users.
You may also want to consider applying this type of optimization to lower-level caches. Do this for HTML proxy caches that store web pages and for database and hard drive caches.
Crawler Impact on Load Balancing
Users usually come from many different IP addresses. A load-balancing system distributes the load to several front-end servers based on IP addresses. A crawler usually sends all its requests from the same IP address or a very small number of IP addresses. If you treat robots the same way as users, the front-end server that receives the crawler's requests will be heavily loaded compared to others. Unlucky users who happen to be directed to the same server will experience slower performance.
What can you do?
Check how the load balancer behaves towards robots (e.g., no "sticky sessions"). Alternatively, dedicate one front-end server to robots to avoid impacting users in case performance declines.
Computing-Intensive Pages
One of the objectives of a crawler is to find out what is in the website structure. Users rarely visit deep pages, and that is also where issues exist like search pages with heavy pagination, navigation pages with high filter combinations, and low-quality automatically generated pages. Because of their nature, there are a lot of deep pages. Deep pages tend to be computing-intensive, as they require more complex queries than pages found higher in the website structure, and they are not cached since users do not visit them.
What can you do?
Monitor the crawl speed or the website load, and slow the crawl if performance goes down. Botify enables you to monitor the crawl progress and change the crawl speed.
TCP Connections Persistence
During a fast crawl, the web server can enter a degraded mode for TCP connection management. Most bots do not use the HTTP request "keep-alive" option much, or not at all. This option keeps the connection open once they receive a page to avoid opening another one immediately afterward. Without "keep-alive", they create a new connection for each page. On the other hand, most web servers are optimized for reused connections (keep-alive) and switch to a degraded mode otherwise.
What can you do?
Check that the crawler you are using can make ample use of the "keep-alive" option, as Botify's crawler does.
Potential Bandwidth Issues
Some web servers are optimized to serve pre-generated static pages. When crawling very fast, there can be a significant performance drop due to a bottleneck where it is unexpected: there is insufficient bandwidth available at the front of the website. Bandwidth usage is usually not one of the main closely monitored indicators, which tend to be pages per second, errors per second, visits per page, etc.
What can you do?
Evaluate what crawl rate, added to user traffic, can bring you close to the bandwidth limit. Add bandwidth monitoring for the crawl and globally, at the web server level.