π This is a reference for writing regular expressions for HTML extracts in Botify.
Overview
You can use regular expressions (regex) to define how to extract information from your pages to create custom fields in Botify. Use this guide as a reference when defining regular expressions in your HTML extracts.
Regular Expression Functions Supported
The following regular expression functions are supported with HTML extracts in Botify:
. : any character
\d : a digit
\D : anything but a digit
\w : a 'word' character (letter, digit or underscore)
\W : anything but a 'word' character (anything but a letter, a digit, or an underscore)
\s : a whitespace character
\S : anything but a whitespace character
\t : tab
\r : carriage return
\n : newline
[xyz] : x, y or z
[x-z] : x, y or z
[^xyz] : neither x, y nor z
x|y : x or y, prefer x
Defining the Number of Characters Allowed
Where x is a character, a set of allowed characters, or a group:
x+ : one or more x, take as many as possible
x* : zero or more x, take as many as possible
x? : zero or one x, take one if possible
x{n} : exactly n x
x{n,} : n or more x, take as many as possible
x{n,m} : between n and m x, take as many as possible
x+? : one or more x, take as little as possible
x*? : zero or more x, take as little as possible
x?? : zero or one x, prefer zero
x{n}? : exactly n times x
x{n,}? : n or more x, take as little as possible
x{n,m}? : between n and m x, take as little as possible
Defining Groups of Characters to Extract
(expression) : Numbered capturing group: the first captured group will be called $1 in the output format, the second $2, etc.
(?:expression) : Non-capturing group to specify a group you do not need to capture. It is not counted in the sequence of capturing groups: If there are three groups, and the second is non-capturing, the third group will be called $2.
(P<name>expression) : Named capturing group: The group will be called $name in the output format.
Advanced: Zero-Width Assertions
The following work as delimiters. They are markers that are not considered a character, meaning that their length is zero, as opposed to one for any "real" character, visible or not.
\A : At beginning of text (text = full HTML page)
\z : At end of text
\b : At beginning or end of a word
Refer to this resource for more information on regular expression syntax and functionality.
Regular Expression Examples
Here are a few examples of HTML extractions:
Extracting a Character String
Extract a rel=alternate tag for mobile devices.
Code example:
<link rel="alternate" media="only screen and (max-width: 640px)" href="http://m.example.com/top-10-movies-from-the 80s/">
Regular expression: Insert a capturing group between ( ), which allows any character but > to cover the full media value while ensuring the characters after the end of this tag (>) will not be examined:
<link rel="alternate" (media="[^>]*?")>
Operation: Extract First Item Matched
Output format: $1
Type: String
Result from our example: media="only screen and (max-width: 640px)" href="http://m.example.com/top-10-movies-from-the 80s/"
Alternatively, use two extraction rules to extract the media value and the URL separately.
Extracting a Whole Number with a Comma Separator
Extract a number of customer reviews (between 1 and 999,999).
Regular expression: The following is a copy of the line in the code that displays the number of reviews and replaced the number with two capturing groups.
<span id="CustomerReviewText" class="a-size-base">(\d+),?(\d+)? customer reviews</span>
First capturing group: (\d+) for the first set of digits found.
Optional separator: ,? (or [,\.]? if you need to allow either , or . as a separator).
Optional second capturing group: (\d+)?
Operation: Extract First Item Matched
Output format: $1$2 ($2 may be empty)
Type: Integer
Examples:
Extracting 56,278: $1=56, $2= 278, the result is 56278
Extracting 23: $1=23, $2 is empty, the result is 23
Larger Numbers
If the number is between 1 and 999,999,999, you can follow the same approach: add another optional separator and a third optional capturing group.
Regular expression: (\d+),?(\d+)?,?(\d+)?
Operation: Extract First Item Matched
Output format: $1$2$3 ($2 and $3 may be empty)
Type: Integer
Extracting a Price
Regular expression: The following is a copy of the line that displays the price in a page's code and replaced the number after the "$" sign with three capturing groups: the first two for digits before and after the thousands separator as in the previous example, and a third group for the decimal dot and decimals (which we made optional). The $ is escaped as it is a special character (regular expression function):
<span id="priceblock_ourprice" class="a-size-medium a-color-price">\$(\d+),?(\d+)?(\.\d+)?</span>
Operation: Extract First Item Matched
Output format: $1$2$3 ($2 and / or $3 may be empty; note that when there is a decimal part, the decimal dot is included in $3)
Type: Float (as the number may include decimals)
Examples:
Extracting 1,200: $1=1, $2= 200, $3 is empty, the result is 1200
Extracting 49.99: $1=49, $2 is empty, $3 = .99, the result is 49.99
Extracting 5,049.99: $1=5, $2=049, $3 = .99, the result is 5049.99
Checking if Web Analytics Tags are Implemented
Regular expression: The following is a copy of the line in the code that contains the tag information sent to Google Analytics:
ga\('create', 'UA-384998461-2', 'mywebsite.com'\);
There is no capturing group; you only want to check if this tag exists. To get this character string, all characters are specified (no wildcards or character ranges allowed).
Operation: Check if exists
Output format: Not needed, so this entry field disappears when selecting "Check if exists" as the operation. The extracted field will contain True/False for this type of operation.
Alternative: Instead of checking the presence of a particular tag value, extract the site ID and site domain from the tag. This will be required if the analysis covers two subdomains tracked under distinct IDs.
Regular expression: Contains two capturing groups, one for each of these elements:
ga\('create', '(UA-[\d-]+)', '([\w\.]+)'\);
First capturing group: (UA-[\d-]+) for 'UA-' followed by a succession of digits or -,
Second capturing group: ([\w\.]+) for a succession of word characters (letter, digit or '_') or .. The . is escaped (just like the $ earlier, as it is also a special character for regular expressions).
Operation: Extract First Item Matched
Output format: $1/$2, for instance (the captured tag ID and captured site domain separated by a '/').
It could also be id:$1/domain:$2.
Type: String
Troubleshooting
The following are the most common mistakes when writing regular expressions.
Failing to Escape Characters
When you want to include a character with a specific meaning in the regular expression syntax, you must add an escape character ( \) before it.
Common Examples:
.(full stop/period) : This regular expression function allows any character. To look for this specific character (.), add \ before it.
( and ) : These define groups. To look for parentheses in your page code, enter \( and \).
Capturing Too Much
Regular expressions may capture more than you expect or need in the following situations:
Using a "greedy" regular expression (take as much as possible) when you need a "lazy" one (take as little as possible). For example, if nested HTML tags exist, a greedy match will stop at the outer tag, while a lazy match will stop at the inner tag. Use the ? character to enforce a "lazy" expression.
Using a regular expression that is not specific enough. For instance, a portion of the expression should only include digits, but you allowed any character, so the regex may match unwanted expressions.
Using a regular expression with too many optional parts: avoid using the * quantifier, which allows "zero or more" occurrences of the preceding character.
Capturing Too Little
To avoid regular expressions that capture less than you expect, always include required and optional parts in your regular expression. If everything in your regular expression is optional, it will match an empty string. Results may include empty strings, while non-empty matches may be further away on the page.
See Also: