🛠 Botify’s LogAnalyzer requires information from your CDN or web server logs. This document describes the integration process and explains the information required to set up the service.
Overview
Integrating your web traffic data with your Botify project enables analysis of how search engine bots interact with your website. Your web traffic is stored in log files, typically hosted by a CDN, which record every interaction with your website by users and search engines. Integrating this data with Botify’s crawl data helps you realize the full benefit of Botify, providing key metrics that offer insight into how search engines crawl your site and how crawl activity correlates with visitor traffic.
👍 We have a secure process for your log file delivery and protect any Personally Identifiable Information (PII). As additional security, we request you remove the IP addresses of all user visits in your log files.
How to Integrate Web Traffic Data
The integration process includes the following steps:
Botify delivers the first dashboard to your project, and you validate the data.
After the first delivery and confirmation, data is automatically updated in your Botify project daily.
💡 We strongly recommend defining the segmentation in your project after Botify delivers the first log data to enable detailed analysis based on your website page categories.
Preparing Log Content
We need all your CDN log files. If you do not use a CDN or part of your traffic goes directly to your website, we need the logs from all your front-end web servers that receive direct requests from crawlers and users.
Required Fields
For each log line, include the following fields:
Field | Description | Example |
Date | The exact date of the request, preferably with the timezone. | Apache format: [Wed Oct 11 14:32:52 2000 +0100] IIS format: 2014-06-24 03:59:59 |
URL | The full URL, including query parameters. | Apache format: "GET /cdn/H264-512x384/video/x8qujc.mp4?auth=1393905581-2-18kwwfcc-4a8a74d75a6e4e8575592bece46a8910 HTTP/1.1" IIS format: /world/communaute/moteur/googlepos/todo.asp Instance=jumbo-4&Version=20120717 |
Referer | The page from which the connection was made. |
|
User agent | The browser or bot that issued the request. |
|
HTTP Status Code |
| 200, 301, 404 |
Domain | If the files contain logs for different subdomains, the virtual host (domain) associated with the URL (e.g., news.example.com), either in the URL or in a separate field.
If your logs contain lines that do not belong to the domains you want to integrate, tell us which domains to keep or remove from the logs. |
|
Protocol | The protocol in which the file is provided (HTTP or HTTPS), especially for HTTPS websites, either in the URL or in a separate field. These fields are included in the default formats on Apache, Nginx, Varnish, and IIS servers.
If you cannot add the protocol in the logs, let us know your website's default protocol. |
|
Optional Fields
Field | Description |
Client IP Address | The client IP for crawl lines (i.e., the address of the machine sending the HTTP request). This IP address allows us to verify the user-agent's authenticity to help detect attempts at user-agent spoofing where search engine user-agent strings are used to access a site illegitimately. |
Comments | Log files may contain comments starting with a #. |
For non-IIS servers, we recommend using the Apache Combined Log Format without any change if you can configure it on your servers:
%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"
Data Types to Include
Include the following types of data in your log files, filtering content where needed:
Crawls
In the logs, we detect that a bot explored a page through the user agent. We activate the detection of Googlebot, Google subbots, and AI bots by default. If you are on a Botify Pro plan, we also activate Bing. Contact us if you need to activate Yandex, Naver, or Baidu bot analysis.
The following is a log line example for a crawl from Googlebot:
forum.example.com 123.123.123.123 [17/Nov/2014:02:25:53 +0100] "GET /viewtopic.php?f=1&t=2&start=10 HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 200 23066
Visits
Botify detects a visit from a human user through the Referer found in the logs. While not required, using the visits feature in Botify is strongly recommended to directly compare visits according to logs with crawls from search engines. To measure organic visits using your log files, you must include the Referer field in your server log data.
👀 The Referer may be absent from CDN log files by default. We encourage you to verify this and ask the CDN team to activate this field before you send us your first log files. If you use Google Ads, please provide us with the URL parameters to identify SEA visits and not treat them as SEO visits.
Here is a log line example for an organic visit (where Google.fr referred the user):
www.example.com 123.123.123.123 [17/Nov/2014:09:47:49 +0100] "GET /example/page.html HTTP/1.1" "http://www.google.fr" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0" 200 24230
👉 Be sure to include AI bot user agents if you want to include their traffic.
Delivering Log Files
Processing your web server log files allows us to analyze the main search engine bots' exploration of your website and count SEO visits sent from search engine result pages daily.
Individual Files
Within the server log files, Botify uses only a subset of log lines related to the search engines you want to monitor. You may provide us with this data subset or the full log file content.
We need the log files from all web servers, including cache servers, if applicable.
The full path for each log file must be unique over time. During daily deposits, new files should not replace previous files.
Each log file should be compressed individually using one of the following formats: gzip, bzip2, zip, xz. We do not accept the following formats: zip files containing multiple files, tar.gz, 7zip, or rar.
File names cannot contain any spaces.
Volume
There is no theoretical limit to the number of logs you may upload. If you expect to upload more than 200GB of logs per day, let the Botify Support team know in advance. Our system handles files up to 8GB each. If you deliver larger files, we will need to add a split step before the processing step, which will add delays for the initial setup and before every daily update.
Filtering
At the start of a log file processing, Botify runs a filtering process that discards all lines that are not crawls or visits and anonymizes all IP addresses for visit lines. The result is an anonymized file reduced to its minimal size that is processed again to obtain the SEO data displayed in the application. You can choose whether this filtering and anonymization process occurs in the European Union or the United States.
You may filter and anonymize logs yourself before sending them to us to reduce the volume of sent logs and anonymize them on your side by doing the following:
Hide the IP address for the log lines that are not crawl lines.
Delivery Methods
Botify offers two methods for delivering log files:
You deliver the logs via FTP/FTPS/SFTP storage.
FTPS/SFTP
Location | Security | Naming Convention |
Our private storage space at .upload.botify.com in the /logs/ subdirectory without additional subdirectories. | FTP/FTPS: Use the Botify-provided password
SFTP: Send us the public key you will use to deliver the logs. | YYYYMMDD.log |
Example:
logs/20150130.webserver1.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver4.log.gz logs/20150131.webserver3.log.gz logs/20150131.webserver2.log.gz logs/20150131.webserver4.log.gz logs/20150131.webserver4.log.gz logs/20150130.webserver1.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver4.log.gz logs/20150131.webserver3.log.gz logs/20150131.webserver2.log.gz logs/20150131.webserver4.log.gz logs/20150131.webserver4.log.gz ...
FTP protocol uses ports 20 and 21.
The FTPS configuration is more precisely an "FTP over TLS" configuration on a pure-ftpd server. This configuration uses port 21 for the "command" channel and random ports for the "data" channels. For more information on TLS and clients that support it, refer to http://download.pureftpd.org/pure-ftpd/doc/README.TLS.
To restrict the network flows, we advise you to configure an IP restriction on the IP address of
.upload.botify.com
.
Delivering Recurring Log Files
If you deliver logs daily, you must provide the previous day's logs each night. We take your new log files into account every six hours, but you can upload new log files at any time interval. The delay in making your data available in the logs report depends on your data volume, but it is usually less than one hour after your logs were taken into account if segmentation has not changed since the previous processing.
❗️Change Management
To avoid disruption in your recurring log delivery, please alert your Botify account team immediately if any of the following changes occur:
A change in the file name format.
A change in the file content format.
A new file type (new name patterns in the upload folder).
A change in the time of upload.
Validating Log Files
As soon as the first files arrive, Botify validates the log files to ensure the following:
We support the format.
All necessary information is included in the data you provide (all expected fields and at least one line of bot exploration, and one line of visits).
After the first log report is delivered to your Botify project, we ask you to validate the following:
Volume Validation
Confirm the volume of analyzed data corresponds to the expected volume of data. Please work with your Botify account team to inspect the following and confirm the volumes correspond to your expectations:
Number of SEO visits within a day
Number of active pages within a day
Total crawls from Google within a day
Unique crawls from Google within a day
URL and Domain Validation
Verify the URLs and domains included in your LogAnalyzer dashboard match your expectations. In the URL Explorer, filter on “URL does not contain” your expected domains to ensure all URLs match your scope. The Botify Professional Services team can assist in this validation and the “Logs QA” process.
Log File Hosts
We need the logs from your CDN for all web traffic that goes through your CDN. Please refer to the appropriate guide for additional requirements specific to each provider:
Contact Support
If you need any assistance, please contact Support.
See also: