📘 This article explains creating custom fields through HTML extracts in Botify.
Overview
You can use the Botify crawler to extract custom data from your pages and have that data appear as custom fields in your Botify reports. Construct HTML extracts in your project to define the type and location of data to extract from your pages with every Botify crawl.
Defining HTML Extraction Rules
You can define up to five HTML extraction rules per project. Access HTML extraction rules by navigating to Project Settings > HTML Extract tab from the main project navigation bar.
Creating an HTML extraction rule includes the following steps:
Choosing the Extraction Method
Select one of the following methods for extracting the HTML:
CSS Selector: Use a CSS Selector to define the HTML fragments, attributes, or tags that contain the custom data you want to extract in Botify. CSS selectors identify specific HTML elements and extract either the inner HTML (e.g., text) or an HTML attribute (e.g., “href”or “alt”). You can identify a CSS selector even if you are not an experienced user of HTML/CSS. Use this reference for detailed CSS selector syntax.
Regular Expression: Regular expressions (regex) can be more flexible than CSS selectors when extracting data but are typically more complex to write. This article includes several examples of regular expressions for extracting data that can be a good starting point.
Naming the Extraction Rule
The Rule Name identifies your custom field in filters and report columns. Choose an explicit name that identifies the field content (e.g., Comment Count, Missing or Empty Alt Tag).
Identifying the HTML Code
On a page on your website that includes the information you want to extract, find the piece of HTML code that contains one of the following types of information:
Information visible to users: For example, the number of comments on a blog post.
Information only in the page code: For example, a Google Analytics tag.
Locating Code for Information Visible to Users
Right-click on the information you want to extract. The following is an example in Google Chrome to extract the number of comments:
Select Inspect:
Locating Code for Information Only in the Page Code
Right-click on the page to open the full-page HTML source code, then select View Source. The following is an example in Google Chrome to extract a Google Analytics tracking tab.
Search for the information you are looking for ("UA-" at the beginning of the Google Analytics tag in this example) using (CTRL+F):
Writing the Regular Expression
To find the code you want to extract by writing a regular expression:
Determine where to begin and where to end. The regular expression should be strict enough to capture only the piece of code you want and not other similar pieces. For instance, it could begin with a section name or another fixed element not found elsewhere in the code and end with the closing tag immediately following.
In the comments example from above, the regular expression could capture the following:To copy the piece of code, right-click in your browser’s Developer Tools and select Edit as HTML to copy the HTML code you want to extract.
Paste your regular expression into the Regex field in the extraction rule.
💡 When using “Inspect Element” in your browser, you are viewing the fully-rendered version of your page. If Botify is not crawling your site with JavaScript crawling enabled, you should use “View Source” instead to view the static HTML.
Regex Variables
In most cases, the regular expression will allow one or several variable parts, some of which may be captured to be included in the custom field value. For example, to extract the number of comments, replace the actual comments number with a regular expression function that allows a succession of digits (\d+) and capture that number with enclosing brackets (\d+):
<h3 class="single-block-title comments-title">(\d+) comments</h3>
👉 This is a simple example where the number is between 1 and 999. To extract a number with a thousand separator, see "Extracting a whole number" in the HTML Extract FAQs examples.
The following are other examples where the actual text extracted is less important:
Checking the presence of a specific web analytics tag
Checking the presence of "No results" in an internal search result page
Checking the presence of "Page not found" to be able to identify soft 404s
Checking the presence of comments without extracting the number
Creating the CSS Selector
To find the code you want to extract using a CSS selector:
Use your browser’s Developer Tools ("Inspect Element”) to copy the desired selector directly. For example, to check whether a post had an embedded Instagram post, copy the CSS selector from Chrome Developer Tools:
Copying the selector gives the following, which checks for the “instagram-embed” CSS Id:
#instagram-embed-0Paste the selector into the CSS Selector field.
Choose the extraction method:
Inner HTML: The inner HTML of this tag (the contents of the <iframe>).
Attribute: The value of one of this tag's attributes (the URL in the SRC attribute in the example above to get the Instagram URL).
Choosing the Extraction Operation
Choose the "Operation" type to define what to do with the data extracted:
Extract First Item Matched: Extract one or several variable parts from the first match on the page.
Extract First 3 Item Matched: Extract one or several variable parts from each of the first three matches on the page.
Count Number of Occurrences: Count the number of matches.
Check if Exists: Check if there is a match.
Text Length: Count the number of characters in the first match found.
Specify the output format and type in the next step if extracting a variable part. Otherwise, go directly to Testing the Extraction Rule.
Selecting the Output Format and Type
The Output type indicates how the field will be stored: as a character string or a number. In this case, you must indicate whether it is a whole (integer) or a decimal (float) number. If it is a number, then the output format must contain only digits (and an optional decimal for float).
The output format indicates what you want to see in the extracted field in the URL Explorer. It only applies if you choose to extract data in the Operation field (as opposed to count occurrences or check presence).
There will often be only one capturing group in the regular expression, and the output format will be that variable name ($1). For instance, the vendor name in a marketplace. You must indicate how these will be displayed when you need to capture several groups. For example, with pagination:
If you capture the current page number and the page number in the "Last" link, you could choose the output format: Page $1 out of $2 where $1 is the current page number and $2 is the number of pages in the list.
In this example, you could also use two extraction rules to store the current page and the number of pages in the list separately to sort your URLs based on either of these numbers (which would not be possible with Page $1 out of $2 as this result would be a character string).
Output Formats
Text: Any character string.
Integer number: A numerical value with no decimal point (whole number) and no thousands separator.
Floating point number: A numerical value with an optional decimal separator and no thousands separator. Botify will store up to 10 decimals and round up the last one if there are more. You will be prompted to select the decimal separator.
Date: Several date formats are available.
Testing the Extraction Rule
You can test your extract rules on sample pages to validate they work as expected before or after you save them. Test your extraction rule by providing the test page URL or HTML code containing the information you want to extract.
If successful, the value of the extracted field is displayed:
If unsuccessful, the following is displayed:
Ensure you repeat the test with several pages, especially if variations in the extracted code exist. When satisfied, click Create Rule.
Once your next analysis is complete, the number of URLs from where each custom field was extracted is displayed in the Extracted HTML Code report in SiteCrawler, and the details are available in the URL Explorer.
See Also: