Integrating Web Server Log Data
Updated over a week ago

🛠 Botify’s LogAnalyzer setup requires information from web server logs. This document describes the integration process and explains the information required to set up the service.

Overview

The integration of your web server log files, which record all transactions on your web server, is required for Botify to analyze how search engine bots interact with your website. Log data is contained in date-stamped text files with one line for each request; the specific format differs based on your web server host. The basic information in all log files includes the request date, type of request, the requestor, and the HTTP response. The log lines for user visits and search engine traffic provide insights in Botify into your website’s health.

When integrated, your log data is available throughout Botify, including LogAnalyzer, SiteCrawler, Analytics Overview, and RealKeywords. Your log data combined with Botify’s crawl data helps you realize the full benefit of Botify with key metrics that give insight into how search engines are crawling your site and how crawl activity correlates to visitor traffic.

👍 We have a secure process for your log file delivery and protect any Personally Identifiable Information (PII). As additional security, we request you remove the IP addresses of all user visits in your log files.


Logs Integration Process Overview

The integration process includes the following steps. Please contact Support if you have any questions about the logs integration process.

  1. The client prepares the log files based on the requirements defined here.

  2. The client delivers logs to Botify.

  3. Botify delivers the first dashboard to the LogAnalyzer project, and the client validates the data.

After the first delivery and confirmation, data is updated in LogAnalyzer daily.

💡 We strongly recommend defining the segmentation in your project after Botify delivers the first log data to enable detailed analysis based on your website page categories.


Preparing Log Content

We need all files from all your front-end web servers that receive direct requests from crawlers and users.

If you are using a CDN:

  • If all of your traffic goes through your CDN, we only need the CDN logs.

  • If part of your traffic goes directly to your website and is not recorded by the CDN, we need your web server logs and the CDN logs.

Required Fields

Each log line should include the following fields:

  • Date: The exact date of the request, preferably with the timezone (e.g., +0100).
    Apache format example: [Wed Oct 11 14:32:52 2000 +0100]
    IIS format example: 2014-06-24 03:59:59

  • URL: The full URL, including query parameters.
    Apache format example: "GET /cdn/H264-512x384/video/x8qujc.mp4?auth=1393905581-2-18kwwfcc-4a8a74d75a6e4e8575592bece46a8910 HTTP/1.1"
    IIS format example: /world/communaute/moteur/googlepos/todo.asp Instance=jumbo-4&Version=20120717

  • Referer: The page from which the connection was made.

  • User Agent: The browser or bot that issued the request.

  • HTTP Status Code: The HTTP Status Code of the response (e.g., 200, 301, 404).

  • Domain associated with the URL: If the log files contain logs for different subdomains, the virtual host (domain) associated with the URL (e.g., news.example.com), either in the URL or in a separate field.

  • Client IP Address: If possible (optional), the client IP for crawl lines (address of the machine sending the HTTP request). This IP address allows us to verify the user-agent's authenticity. Botify can help detect attempts at user-agent spoofing where search engine user-agent strings are used to access a site illegitimately. Sending the client IP address in your server log files ensures that Botify can identify and exclude spoofed requests.

  • Protocol: The protocol in which the file is provided (HTTP or HTTPS), especially for HTTPS websites, either in the URL or in a separate field.
    These fields are included in the default formats on Apache, Nginx, Varnish, and IIS servers.

For non-IIS servers, we recommend you use the Apache Combined Log Format without any change if you can configure it on your servers:

%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"

Log File Data Types to Include

Include the following types of data in your log files, filtering content where needed.

Crawls

We detect that a bot explored a page through the User Agent in the logs. Our default behavior is to activate the detection of Googlebot, Google subbots, and Bing. Please tell us if you need to activate the analysis of Yandex, Naver, or Baidu bots.

The following is a log line example for a crawl from Googlebot:

forum.example.com 123.123.123.123 [17/Nov/2014:02:25:53 +0100] "GET /viewtopic.php?f=1&t=2&start=10 HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 200 23066

Visits

Botify detects a visit from a human user through the referer found in the logs. While not required, using the visits feature in Botify LogAnalyzer is strongly recommended to directly compare visits according to logs with crawls from search engines. If you intend to measure organic visits using your log files, including the referer field in your server logs is mandatory.

If you are using a CDN, the Referer is possibly absent from your log files by default. We encourage you to verify this and ask the CDN team to activate this field before you send us your first log files. If you use Google Ads, please provide us with the URL parameters to identify SEA visits and not treat them as SEO visits.

Here is a log line example for an organic visit (where Google.fr referred the user):

www.example.com 123.123.123.123 [17/Nov/2014:09:47:49 +0100] "GET /example/page.html HTTP/1.1" "http://www.google.fr" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0" 200 24230

Domains

If your servers handle several countries or sites, log files may contain lines that do not belong to the domains you want to integrate. In this case, tell us which domains we need to keep or remove from the logs.

Protocol

If you cannot add the protocol in the logs, let us know if we should treat all URLs as HTTP or HTTPS (whichever is the default protocol on your website).

Comments

Log files may contain comments starting with a #.


Delivering Log Files

Processing your web server log files allows us to analyze the main search engine bots' exploration of your website and count SEO visits sent from search engine result pages daily.

Individual Files

Within the server log files, Botify uses only a subset of log lines related to the search engines you want to monitor. You may provide us with this data subset or the full log file content.

  • We need the log files from all web servers, including cache servers, if applicable.

  • The full path for each log file must be unique over time. During daily deposits, new files should not replace previous files.

  • Each log file should be compressed individually using one of the following formats: gzip, bzip2, zip, xz. We do not accept the following formats: zip files containing multiple files, tar.gz, 7zip, or rar.

  • File names cannot contain any spaces.

Volume

There is no theoretical limit to the number of logs you may upload. If you expect to upload more than 200GB of logs per day, let the Botify Support team know in advance. Our system handles files up to 5GB each. If you deliver larger files, we will need to add a split step before the processing step, which will add delays for the initial setup and before every daily update.

Filtering

At the start of a log file processing, Botify runs a filtering process that discards any line that is neither a crawl line nor a visit line, then an anonymization process that discards all IP addresses for visit lines. The result is an anonymized file reduced to its minimal size that is then processed again to obtain the SEO data displayed in the application. You can choose whether this filtering and anonymization process occurs in the European Union or the United States.

You may filter and anonymize logs yourself before sending them to us to reduce the volume of sent logs and anonymize them on your side by doing the following:

Delivery Method

Botify offers two methods for delivering log files:

  • You deliver the logs via FTP/FTPS/SFTP storage.

  • Botify fetches the logs from an AWS S3 storage bucket.

FTP/FTPS/SFTP

You can deliver the files through FTP, FTPS, or SFTP to our private storage space at .upload.botify.com.

  • For FTP/FTPS, you can use a password that Botify will provide.

  • For SFTP, we will ask you to send us the public key that you will use to deliver the logs.

Please use the YYYYMMDD.log naming convention and deliver the logs to this location in the subdirectory /logs/ without additional subdirectories.

Example:

logs/20150130.webserver1.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver4.log.gz logs/20150131.webserver3.log.gz logs/20150131.webserver2.log.gz logs/20150131.webserver4.log.gz logs/20150131.webserver4.log.gz logs/20150130.webserver1.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver2.log.gz logs/20150130.webserver4.log.gz logs/20150131.webserver3.log.gz logs/20150131.webserver2.log.gz logs/20150131.webserver4.log.gz logs/20150131.webserver4.log.gz ...
  • FTP protocol uses ports 20 and 21.

  • The FTPS configuration is more precisely an "FTP over TLS" configuration on a pure-ftpd server. This configuration uses port 21 for the "command" channel and random ports for the "data" channels. For more information on TLS and clients that support it, refer to http://download.pureftpd.org/pure-ftpd/doc/README.TLS.

  • To restrict the network flows, we advise you to configure an IP restriction on the IP address of .upload.botify.com.

AWS S3

Botify can obtain the server logs from an AWS S3 bucket. To use this method:

  1. Create a dedicated AWS user and a corresponding key pair (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

  2. Give authorizations on your bucket's "s3:List" and "s3:Get_" actions and all the useful subfolders.

Example of a corresponding IAM policy:

{
"Id": "Policy1488876114416",
"Version": "2012-10-17",
"Statement": [
{ "Sid": "Stmt1488875932621",
"Effect": "Allow",
"Action": [
"s3:List*",
"s3:Get*"
],
"Resource": [
"arn:aws:s3:::my-bucket/*",
"arn:aws:s3:::my-bucket"
]
}
]
}

Once the authorization is in place, provide Botify with the following information, which we will use to fetch the logs regularly:

  • The key pair AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY of the newly created account.

  • The name of the AWS region where the bucket is located.

  • The name of the bucket.

  • The time of the day to retrieve the logs.

  • Any other useful information to determine what files to fetch (e.g., subfolder, filter to apply on the first characters of the filename).

👉 If the bucket or subfolder contains many files and it is impossible to target the files to fetch based on their prefix, Botify will be unable to fetch them.

Delivering Recurring Log Files

If you deliver logs daily, you must provide the previous day's logs each night. We take your new log files into account every six hours, but you can upload new log files at any time interval. The delay in making your data available in the logs report depends on your data volume, but it is usually less than one hour after your logs were taken into account if segmentation has not changed since the previous processing.

❗️Change Management

To avoid disruption in your recurring log delivery, please alert your Botify account team immediately if any of the following changes occur:

  • A change in the file name format.

  • A change in the file content format.

  • A new file type (new name patterns in the upload folder).

  • A change in the time of upload.


Validating Log Files

As soon as the first files arrive, Botify validates the log files to ensure the following:

  • We support the format.

  • All necessary information is included in the data you provide (all expected fields and at least one line of bot exploration, and one line of visits).

After the first log report is delivered to your Botify project, we ask you to validate the following:

Volume Validation

Confirm the volume of analyzed data corresponds to the expected volume of data. Please work with your Botify account team to inspect the following and confirm the volumes correspond to your expectations:

  • Number of SEO visits within a day

  • Number of active pages within a day

  • Total crawls from Google within a day

  • Unique crawls from Google within a day

URL and Domain Validation

Verify the URLs and domains included in your LogAnalyzer dashboard match your expectations. In the URL Explorer, filter on “URL does not contain” your expected domains to ensure all URLs match your scope. The Botify Professional Services team can assist in this validation and the “Logs QA” process.


Log File Hosts

We need the logs from your CDN for all web traffic that goes through your CDN. Please refer to the appropriate guide for additional requirements specific to each provider:


Log File Checklist

Use the following checklist to confirm your log files meet all requirements:

Contact Support

If you need any assistance, please contact Support using the email address for your region:


See also:

Did this answer your question?