Skip to main content

Overriding Robots.txt in Botify Crawls

Updated over a year ago

📘 This article explains the virtual robots.txt option available to override the rules in your robots.txt file during Botify crawls.

Overview

To specify allow/disallow directives for the Botify crawler that differ from those in your website's robots.txt file, you can use the virtual robots.txt option in Advanced Crawl Settings. The following are some examples of when you may want to use this option:

  • To ignore certain pages based on their URL patterns, such as ignoring user reviews on an e-commerce website.

  • To restrict the analysis to a specific folder (e.g., your site's blog section). Read more on how to crawl only a folder.

  • To analyze pages that are normally disallowed to robots:

    • Include pages in the analysis that are currently disallowed to see the impact they would have on your website structure if you removed the disallow.

    • Analyze a new version or section of your website that is currently disallowed to robots.

Identifying Virtual Robots.txt Rules

Identify virtual robots.txt rules in the Behavior section of crawl Advanced Settings:

virtualrobots.jpg
  1. Copy and paste your existing robots.txt file in the Virtual Robots.txt text box. You can type new rules here; however, copying and pasting your existing file ensures you get all the existing rules you want to preserve.

  2. Scroll to the bottom of the page, then click Save.

Example

To instruct the Botify crawler to ignore the robots.txt found online and allow everything instead, identify the following in the Virtual Robots.txt text field:

User-agent: *
Allow: /

This Virtual Robots.txt example applies to all domains allowed for the crawl, as specified in the main project settings. You can also define rules that only apply to a specific domain or a set of rules by domain.

How Botify Crawls with a Virtual Robots.txt

When directives are present in the Virtual robots.txt field, Botify crawls your site following these rules instead of what is present in the robots.txt file on your website. Using the virtual robots.txt option does not impact the robots.txt file on your website; it is only used during Botify crawls for simulation.

The directives apply to all domains if you did not identify a header in the virtual robots.txt. Refer to the Identifying Rules by Domain section below.

Supported Features

  • User agent: Botify crawls use the Botify, Googlebot, or * user agent, depending on which is available, whether the crawl fetches the robots.txt file or uses a virtual robots.txt. All other user agents are ignored. This means the crawler follows the directives for "Botify" if there is a 'user-agent: botify' section in robots.txt or the virtual robots.txt; if not found, it will follow the directives for 'user-agent: googlebot'. If neither is found, it will follow the general rules in the "user-agent:*" section.

  • Comments are supported in virtual robots.txt when placed after a #.

  • As in a regular robots.txt file, the user-agent string values are case insensitive and the path string values are case sensitive.

  • Allow and Disallow are the only supported directives. Directives such as crawl-delay or sitemap are not supported.

  • Vertical robots such as Googlebot Image or Googlebot Mobile are not supported.

Identifying Rules by Domain

To include rules for different domains or subdomains in your virtual robots.txt, use headers that indicate the domain or domains to which each set of rules applies. The header is placed between [ ] and is followed by a regular robots.txt file content.
Repeat as needed:

[header]
robots.txt content [header] 
robots.txt content 
…

Examples of headers:

[*] # all domains for HTTP or https[https://*] # all domains for https only[www.mywebsite.com] # only that domain, for HTTP or https[*.mywebsite.com] # all subdomain for that domain, for HTTP or https[http://*.mywebsite.com] # all subdomains for that domain, and HTTP only

Note that the wild card is only allowed in the following conditions:

  • At the beginning of the hostname AND before a ‘.’

  • For the full hostname

  • For the full header (meaning any protocol, any domain)

Examples of headers that are not allowed:

[http://www.mywebsite.*] # wild card is not at the beginning of hostname[http:blog.*.mywebsite.com] # wild card is not at the beginning of hostname[http://*mywebsite.com] # wild card is not followed by a ‘.’

Examples of Virtual Robots.txt Content

1st example:

 # no header, the content below was copied from the 
# website’s robots.txt and pasted in the Virtual Robots.txt 
User-agent: googlebot
Allow: /
Disallow: /private/ 
# the Botify crawler will apply these rules to ALL the domains it will crawl.
# it’s the same as if there was a [*] header

2nd example:

[www.mywebsite.com]
User-agent: botify # could also be 'googlebot'
Allow: /
Disallow: /private/ # the Botify crawler will fetch 'real' /robots.txt files online for other domains

3rd example:

[www.mywebsite.com]
User-agent: botify # could also be 'googlebot'
Allow: /
Disallow: /private/[beta.mywebsite.com] # only botify is allowed
User-agent: botify
Allow: /
User-agent: *
Disallow: /
Did this answer your question?