Skip to main content

Regular Expressions in HTML Extracts

Updated over a year ago

πŸ“˜ This is a reference for writing regular expressions for HTML extracts in Botify.

Overview

You can use regular expressions (regex) to define how to extract information from your pages to create custom fields in Botify. Use this guide as a reference when defining regular expressions in your HTML extracts.

Regular Expression Functions Supported

The following regular expression functions are supported with HTML extracts in Botify:

. : any character

\d : a digit

\D : anything but a digit

\w : a 'word' character (letter, digit or underscore)

\W : anything but a 'word' character (anything but a letter, a digit, or an underscore)

\s : a whitespace character

\S : anything but a whitespace character

\t : tab

\r : carriage return

\n : newline

[xyz] : x, y or z

[x-z] : x, y or z

[^xyz] : neither x, y nor z

x|y : x or y, prefer x

Defining the Number of Characters Allowed

Where x is a character, a set of allowed characters, or a group:

  • x+ : one or more x, take as many as possible

  • x* : zero or more x, take as many as possible

  • x? : zero or one x, take one if possible

  • x{n} : exactly n x

  • x{n,} : n or more x, take as many as possible

  • x{n,m} : between n and m x, take as many as possible

  • x+? : one or more x, take as little as possible

  • x*? : zero or more x, take as little as possible

  • x?? : zero or one x, prefer zero

  • x{n}? : exactly n times x

  • x{n,}? : n or more x, take as little as possible

  • x{n,m}? : between n and m x, take as little as possible

Defining Groups of Characters to Extract

  • (expression) : Numbered capturing group: the first captured group will be called $1 in the output format, the second $2, etc.

  • (?:expression) : Non-capturing group to specify a group you do not need to capture. It is not counted in the sequence of capturing groups: If there are three groups, and the second is non-capturing, the third group will be called $2.

  • (P<name>expression) : Named capturing group: The group will be called $name in the output format.

Advanced: Zero-Width Assertions

The following work as delimiters. They are markers that are not considered a character, meaning that their length is zero, as opposed to one for any "real" character, visible or not.

  • \A : At beginning of text (text = full HTML page)

  • \z : At end of text

  • \b : At beginning or end of a word

Refer to this resource for more information on regular expression syntax and functionality.

Regular Expression Examples

Here are a few examples of HTML extractions:

Extracting a Character String

Extract a rel=alternate tag for mobile devices.

Code example:

<link rel="alternate" media="only screen and (max-width: 640px)" href="http://m.example.com/top-10-movies-from-the 80s/">

Regular expression: Insert a capturing group between ( ), which allows any character but > to cover the full media value while ensuring the characters after the end of this tag (>) will not be examined:

<link rel="alternate" (media="[^>]*?")>

Operation: Extract First Item Matched

Output format: $1

Type: String

Result from our example: media="only screen and (max-width: 640px)" href="http://m.example.com/top-10-movies-from-the 80s/"

Alternatively, use two extraction rules to extract the media value and the URL separately.

Extracting a Whole Number with a Comma Separator

Extract a number of customer reviews (between 1 and 999,999).

Regular expression: The following is a copy of the line in the code that displays the number of reviews and replaced the number with two capturing groups.

<span id="CustomerReviewText" class="a-size-base">(\d+),?(\d+)? customer reviews</span>

First capturing group: (\d+) for the first set of digits found.

Optional separator: ,? (or [,\.]? if you need to allow either , or . as a separator).

Optional second capturing group: (\d+)?

Operation: Extract First Item Matched

Output format: $1$2 ($2 may be empty)

Type: Integer

Examples:

Extracting 56,278: $1=56, $2= 278, the result is 56278

Extracting 23: $1=23, $2 is empty, the result is 23

Larger Numbers

If the number is between 1 and 999,999,999, you can follow the same approach: add another optional separator and a third optional capturing group.

Regular expression: (\d+),?(\d+)?,?(\d+)?

Operation: Extract First Item Matched

Output format: $1$2$3 ($2 and $3 may be empty)

Type: Integer

Extracting a Price

Regular expression: The following is a copy of the line that displays the price in a page's code and replaced the number after the "$" sign with three capturing groups: the first two for digits before and after the thousands separator as in the previous example, and a third group for the decimal dot and decimals (which we made optional). The $ is escaped as it is a special character (regular expression function):

<span id="priceblock_ourprice" class="a-size-medium a-color-price">\$(\d+),?(\d+)?(\.\d+)?</span>

Operation: Extract First Item Matched

Output format: $1$2$3 ($2 and / or $3 may be empty; note that when there is a decimal part, the decimal dot is included in $3)

Type: Float (as the number may include decimals)

Examples:

Extracting 1,200: $1=1, $2= 200, $3 is empty, the result is 1200

Extracting 49.99: $1=49, $2 is empty, $3 = .99, the result is 49.99

Extracting 5,049.99: $1=5, $2=049, $3 = .99, the result is 5049.99

Checking if Web Analytics Tags are Implemented

Regular expression: The following is a copy of the line in the code that contains the tag information sent to Google Analytics:

ga\('create', 'UA-384998461-2', 'mywebsite.com'\);

There is no capturing group; you only want to check if this tag exists. To get this character string, all characters are specified (no wildcards or character ranges allowed).

Operation: Check if exists

Output format: Not needed, so this entry field disappears when selecting "Check if exists" as the operation. The extracted field will contain True/False for this type of operation.

Alternative: Instead of checking the presence of a particular tag value, extract the site ID and site domain from the tag. This will be required if the analysis covers two subdomains tracked under distinct IDs.

Regular expression: Contains two capturing groups, one for each of these elements:

ga\('create', '(UA-[\d-]+)', '([\w\.]+)'\);

First capturing group: (UA-[\d-]+) for 'UA-' followed by a succession of digits or -,

Second capturing group: ([\w\.]+) for a succession of word characters (letter, digit or '_') or .. The . is escaped (just like the $ earlier, as it is also a special character for regular expressions).

Operation: Extract First Item Matched

Output format: $1/$2, for instance (the captured tag ID and captured site domain separated by a '/').

It could also be id:$1/domain:$2.

Type: String

Troubleshooting

The following are the most common mistakes when writing regular expressions.

Failing to Escape Characters

When you want to include a character with a specific meaning in the regular expression syntax, you must add an escape character ( \) before it.

Common Examples:

  • .(full stop/period) : This regular expression function allows any character. To look for this specific character (.), add \ before it.

  • ( and ) : These define groups. To look for parentheses in your page code, enter \( and \).

Capturing Too Much

Regular expressions may capture more than you expect or need in the following situations:

  • Using a "greedy" regular expression (take as much as possible) when you need a "lazy" one (take as little as possible). For example, if nested HTML tags exist, a greedy match will stop at the outer tag, while a lazy match will stop at the inner tag. Use the ? character to enforce a "lazy" expression.

  • Using a regular expression that is not specific enough. For instance, a portion of the expression should only include digits, but you allowed any character, so the regex may match unwanted expressions.

  • Using a regular expression with too many optional parts: avoid using the * quantifier, which allows "zero or more" occurrences of the preceding character.

Capturing Too Little

To avoid regular expressions that capture less than you expect, always include required and optional parts in your regular expression. If everything in your regular expression is optional, it will match an empty string. Results may include empty strings, while non-empty matches may be further away on the page.


See Also:

Did this answer your question?