Skip to main content

Performance Impact of Robot Crawls

Updated over 3 weeks ago

πŸ“˜ This article explains issues related to bot crawls that can impact site performance.

Overview

Using a crawler like Botify is a great way to get a better understanding of how your website and servers respond to different situations. While there are some differences between Botify's crawler and search engine behavior, the problems Botify's crawler encounter are often the same problems that search engines encounter.

Botify can crawl up to 250 pages per second. But you need to determine the speed your website can tolerate without degrading performance for users. Crawlers exploring your website cannot be easily compared to typical user traffic. Some ways crawlers may impact web server performance are identified below:

How crawls and user visits differ

A crawler typically requests fewer pages per second than the sum of the website's concurrent users. Given this, you might expect the additional load generated by a crawler to have little or no impact on performance. But a website undergoing a fast crawl will often have a performance drop and start returning pages much slower. This is because crawlers and users behave differently when accessing web pages, applying different stress points to the server, and optimizations designed for users often do not work well for crawlers.

πŸ‘‰ Crawlers aim to explore all pages and never request the same page twice.

πŸ‘‰ Users often request the same pages as other users, and they may never visit some pages, such as search pages, filter combinations, and deep pages.

πŸ‘‰ Unlike user visits, crawlers do not use or accept cookies.

Compressed files

Botify crawls your site with GZIP (i.e., "Accept-Encoding: GZIP" HTTP header) enabled by default, though you can disable this in advanced crawl settings. Check with your hosting/infrastructure team if you are unsure whether it would be beneficial to disable this setting. The impacts of crawling with or without a GZIP option are:

  • A crawl without a GZIP option on a website that delivers compressed, pre-computed (i.e., cached) pages very fast requires the server or proxy to decompress all pages, which induces a heavy computing load.

  • A crawl that accepts GZIP and requests pages that are not pre-compressed induces CPU-intensive processing if the server systematically delivers GZIP pages.

What can you do?

If you suspect your web server does not deliver pre-compressed GZIP pages but compresses them at the time of the request, you should disable the GZIP option, especially if you plan to crawl faster than ten pages/second. It is imperative to consider the available bandwidth vs. available CPU to find the right compromise.

Session management

Any website that records user sessions (e.g., using a cookie with database records or a memory cache) will slow down when pages are requested by a crawler: a user who visits 25 pages within ten minutes will only trigger one write and 24 reads. A crawler will trigger 25 writes, as it does not send any cookies and creates a new session with each page. This impacts response time since write operations are among the slowest for servers. Also, these sessions will "fill" the system β€” crawling one million pages will create one million sessions.

What can you do?

Do not create a session when the request comes from known robots, especially if it is recorded in a database. You can do this for Botify when you crawl your website, and, more importantly, for Googlebot, which crawls constantly. Alternatively, do this for search engine bots only and use the custom user agent option to crawl your website with Botify using a Googlebot user agent.

Cache system inefficiency

Many websites are behind a cache system. This type of optimization is very efficient for users because they tend to visit the same pages: one million page views can easily correspond to no more than 1,000 unique pages. The cache system enables the site to be very fast for these user requests. But a crawler that requests one million pages will request one million unique pages. This can lead to a situation where requested pages are never in the cache, straining web servers and making the website slower for bots and users. If crawled pages are stored in the cache, the cache system will fill up with pages that users rarely ever request, if at all, at the expense of pages that were often requested by users. This makes the cache even less efficient for everybody.

What can you do?

Set up the cache system to apply specific rules to robots: deliver cached pages if they are available (i.e., cache hit), but never store a page that was not in the cache (cache miss) and was requested to the server. The cache must only store pages requested by users.

You may also want to consider applying this type of optimization to lower-level caches. Do this for HTML proxy caches that store web pages and for database and hard drive caches.

Load balancing

Users usually come from many different IP addresses. A load-balancing system distributes the load to several front-end servers based on IP addresses. A crawler usually sends all its requests from the same IP address or a very small number of IP addresses. If you treat robots the same way as users, the front-end server that receives the crawler's requests will be heavily loaded compared to others. The users who happen to be directed to the same server will experience slower performance.

What can you do?

Check how the load balancer behaves towards robots (e.g., no "sticky sessions"). Alternatively, dedicate one front-end server to robots to avoid impacting users in case performance declines.

Computing-intensive pages

One of the objectives of a crawler is to discover what is in the website structure. Users rarely visit deep pages, and that is where issues exist like search pages with heavy pagination, navigation pages with high filter combinations, and low-quality automatically generated pages. Because of their nature, there are many deep pages. Deep pages tend to be computing-intensive, as they require more complex queries than pages found higher in the website structure, and they are not cached since users do not visit them.

What can you do?

Monitor the crawl speed or the website load, and slow the crawl if performance goes down. Botify enables you to monitor the crawl progress and change the crawl speed.

TCP connection persistence

During a fast crawl, the web server can enter a degraded mode for TCP connection management. Most bots do not use the HTTP request "keep-alive" option, which keeps the connection open once the bot receives a page. Without this option, bots open a new connection for each page. Most web servers are optimized for reused connections (i.e., keep-alive) and switch to a degraded mode otherwise.

What can you do?

Check that the crawler you are using can make ample use of the "keep-alive" option, as Botify's crawler does.

Potential bandwidth issues

Some web servers are optimized to serve pre-generated static pages. When crawling very fast, unexpected bottlenecks can cause significant performance drops because of insufficient bandwidth available at the front of the website. Bandwidth usage is usually not a closely monitored indicator like pages per second, errors per second, and visits per page.

What can you do?

Evaluate what crawl rate, added to user traffic, can bring you close to the bandwidth limit. Add bandwidth monitoring for the crawl and globally, at the web server level.

Did this answer your question?