12.3 Regex Syntax

Let’s start learning the syntax.

12.3.1 Matching a Specific Sequence

If you are searching for occurrences of one specific sequence of characters, the regular expression to use is just that sequence of characters.

For example, if your regex is bet, then you’ll get a match whenever the characters “b,” “e,” and “t” occur consecutively in the text you are searching.

In the sample text below, the matches are in italics:

I bet you are reading between the lines. Better to read the lines themselves.

We didn’t match “Bet” in “Better” because “B” is uppercase.

The characters “b,” “e” and “t” in the regex “bet” are examples of literal characters. This means that they stand for exactly what they are: the “b” in the expression matches a “b” in the text we are searching, the “e” in the expression matches an “e” in text, and so on.

There are a lot of exceptions to our specific-sequence rule. We’ll get to them soon.

12.3.2 Character Classes

Suppose you want to match either “bet” or “Bet?” One way to do this is to use a character class. A character class consists of a set of characters surrounded by square brackets, and it tells the regex engine to match any one of the characters in the class.

Consider, for example, the regex [Bb]et. It consists of the character class [Bb] followed by the pair of literal characters et. It matches:

I bet you are reading between the lines. Better to read the lines themselves.

Another example: t[aeiou] matches any two-character sequence in which “t” is followed by a lowercase vowel:

Get thee to a nunnery.

12.3.2.1 Ranges

You can match a range of characters. Inside a character class:

a-z represents all lowercase letters from a to z;
A-Z represents all of the uppercase letters;
0-9 represents all of the decimal digits: 0, 1, 2, …, 9.
Other ranges are possible, e.g.:
- c-f denotes the lowercase letters from c to f;
- 0-3 denotes 0, 1, 2 and 3.

Thus, in order to match any letter followed by two digits, you could use [a-zA-Z][0-9][0-9]:

Your room number is B43, not C4 or #39.

12.3.2.2 The Need to Escape

Perhaps now you can spot the problem with the sequence-specific rule: what happens if one of the characters in the sequence is, say ‘[’ or ’]?’ These characters are examples of metacharacters, which means that in the syntax of regular expressions they don’t match themselves but instead have a special role. In the case of square-brackets, that role is to delimit character classes.

If you want your pattern to include a metacharacter you will have to escape it with the backslash. Thus, the correct way to match the string “[aardvark]” would be with the regular expression: \[aardvark\]:

The [aardvark] appears early in the dictionary.

Actually, you only need to escape a metacharacter when it acts in its role as metacharacter. For example, if you want to match “b-b,” you are fine to use b-b. Because the expression contains no square brackets, it’s not possible for the hyphen to act in its special role to set ranges, so it does not have to be escaped.

Inside of square brackets it can matter whether you escape the hyphen. Thus:

[a-c] matches a, b, and c;
[a\-c] matches a, -, and c (but not b);
[a-] matches a and - (the machine can tell that the hyphen was not being used in a range);
[a\-] also matches just a and -, not a, \ and -. (Apparently the regex syntax takes into account the fact that some folks will worry that they might have to escape the hyphen).

Of course since the backslash plays a role in escaping metacharacters, it too acts as a metacharacter at times. Hence if you want to match a backslash you’ll have to escape it! How? By preceding the black-slash with a backslash! Thus:

a\\b matches “a\b”;
a\\\\b matches “a\\b.”

On the other hand:

a\tb matches a followed by a tab followed by b, because in this case the machine recognizes \t as the control character for a tab;
a\ab matches a followed by the bell-alert followed by b, and so on for other control characters.

But keep the following in mind:

a\eb is incorrect regex syntax: \e is not recognized as one of the control characters.
And yet a\wb is correct regex syntax! (It turns out that the token \w is a recognized character class shortcut that means the same as [a-zA-Z0-9_] (all of the lowercase and uppercase letters, the digits, and the underscore character). We’ll get to the shortcuts soon.

12.3.2.3 Named Character Classes

Some character classes occur so commonly that they have been granted special names. Table 12.1 gives a few that are worth remembering.

Table 12.1: A few character classes worth remembering.
Class Name	Represents
[:alpha:]	a-zA-Z
[:alnum:]	a-zA-Z0-9
[:word:]	a-zA-Z0-9_ (note the underscore)
[:space:]	white space
[:lower:]	a-z
[:upper:]	A-Z

Here’s how you would use character class names in a regular expression:

t[[:alnum:]]t matches t followed by any alphanumeric character followed by t. (The double brackets are needed since, according to the rules, t[:alnum:]t would match t followed by any one of :, a, l, n, u or m, followed by t.)

12.3.2.4 Character Class Shortcuts

Some character classes are so very common that they merit extra-short shortcuts: most of these shortcuts begin with a backslash. We’ll call them character class shortcuts. Some of the most common character class shortcuts are shown in Table 12.2

Table 12.2: A few character class shortcuts worth remembering.
Type	Represents
\d	any decimal digit (0-9)
\D	anything not a decimal digit
\s	any white space character
\S	anything not a white space character
\w	any word character (same as [word])
\W	anything not a word character
.	any character except newline

The . requires special care: if you are searching for a literal dot, you’ll have to escape it with a backslash. Thus 32\.456 matches “32.456,” whereas 32.456 matches “32a456,” 32b456", and so on.

12.3.2.5 Negation in a Character Class

Suppose you would like to match any sequence of three characters of this form:

t
any character EXCEPT e, x and z
t

You can accomplish this by negating within a character class: the regex to use is t[^exz]t. Here the caret ^ functions as a metacharacter, indicating that any character except the others in the class are permitted. In order to function as a negation, the ^ must appear immediately after the opening bracket. If it appears elsewhere in the class, then it’s a literal: it just stands for itself. (Outside of a character class the ^ functions as an anchor—we’ll get to these soon—and as such has to be escaped if you want to act as a literal.)

As another example, t[^a-z]t matches t followed by any character expect a lowercase letter, followed by t.

Negating within a particular class of characters can be tricky. For example, suppose you are looking for 3-character sequences that consist of t, any letter except e or E, and then t. Rather than trying to use a ^ it’s easiest to work with ranges: t[a-df-zA-DF-Z]t.

12.3.3 Quantification

Suppose you want to match phone numbers, where the area code is included and the groups of digits are separated by hyphens, as in: 202-456-1111. Taking advantage of the character class shortcut for digits, you could use the following regex:

\d\d\d-\d\d\d-\d\d\d\d

But that’s a bit difficult to read. And besides, what if you weren’t working with phone numbers but instead wanted to match sequences like the following (which has 10 digits in succession)?

A2356737821

It would be awful to write A\d\d\d\d\d\d\d\d\d\d.

This is where quantifiers come in. In regular expression syntax a token of the form {n} indicates that we are looking for n consecutive copies of whatever token precedes the {n}. Thus we can write the phone-number regex more concisely and more legibly as:

\d{3}-\d{3}-\d{4}

Note that the curly braces { and } may now function as metacharacters, and as such may have to be escaped if you are looking for them specifically, Thus if you want to search for occurrences of “t{3}b,” you’ll need the regex t\{3\}b. On the other hand if you want to match “t{swim}b” then it’s fine to use the regex t{swim}b: the machine sees here that the braces don’t play a role in quantification.

Quantifiers are quite flexible. Within a quantifier you can use a comma to indicate a range of permissible number of copies of the preceding expression:

t{3,5} matches 3, 4 or 5 t’s in succession: ttt, tttt, or ttttt;
t{3,} matches three or more t’s in succession: ttt, tttt, ttttt, … .

Because quantification is so often required, regular expression syntax provides shortcuts for special cases:

t* matches 0 or more t’s;
t+ matches one or more t’s;
t? matches 0 or 1 t.

An example: to match the beginning of a URL, use https?://:

The URL https://example.org provides authentication of the site, whereas you should not pay with a credit card on http://flybynight.com, On the other hand htto://example.org is not a valid URL—it’s probably a typo.

Note that these shortcuts introduce new metacharacters: *, + and ?. You’ll probably need to escape them when you want to search for them as literals outside of a character class.

12.3.4 Greedy vs. Lazy

When it comes to the open-ended quantifiers ({n,} or the shortcuts * and +), the default behavior is to make the longest match possible. This is known as greedy behavior. Consider the matches reported in the following text for b{3,}

b bb bbb bbbb bbbbb

It is possible, however, to tell a quantifier to be lazy, meaning that it should give the shortest possible matches. The way to do this is to append a ? to the quantifier. Applied to the same text above the regex b{3,}? with the lazy quantifier reports a different set of (sometimes shorter) matches:

b bb bbb bbbb bbbbb

12.3.5 Grouping

A quantifier refers to the shortest meaningful item immediately preceding it. Look at these examples:

ab+ matches “a” followed by 1 or more “b”’s;
a\d+ matches “a” followed by 1 or more digits;
a[^bc]+ matches “a” followed by 1 or more occurrence of anything other than “b” or “c”;

In the examples above, any match begins with “a.” If we want to get “a” into the scope of the quantifier, we have to group it with the item immediately preceding the quantifier. Grouping is accomplished with parentheses. Consider these examples:

(ab)+ matches one or more occurrences of “ab” in succession;
(a\d{2,})+ matches one or more occurrences of “a” followed by at least two digits. Thus it matches “a23” and “a23a773.” It won’t match “a2a3.” It will match only the “a23” in “a23a6” and it will match only the “a567” in “a4a567.”

Grouping is a great help in constructing highly complex patterns. Bear in mind that the grouping symbols ( and ) function as metacharacters and will have to be escaped outside of character classes when you are searching for them as literals. Inside of character classes they function only as literals. Thus:

[(ab)*] matches any one of the following characters: “(,” “a,” “b,” “)” and "*".
$yes$ matches “(yes).”

12.3.6 Alternation

In regex syntax the symbol | (the vertical “pipe”) functions as a metacharacter meaning or. Thus:

a|b matches “a” and it matches “b”;
bed|bath matches “bed” and it matches “bath”;
a|b|c matches any one of “a,” “b” and “c”;
(aa|bb)+ matches one or more occurrence of “aa” or “bb.” Thus it matches “aa,” “bb,” “aaaa,” “aabb,” “bbaa,” “bbbb” and so on.
(a(a|b)b)+ matches “aab,” “abb,” “aabaab,” “aababb,” and so on.

Use of the pipe-symbol is known as alternation.

In most implementations of regular expressions alternation works rather slowly, so use character classes instead whenever you can. For example, [a-e] is preferred over a|b|c|d|e.

12.3.7 Anchors

Sometimes we are interested in patterns that occur in specific places in text, such as:

at the beginning of a string;
at the end of a string
at the beginning or end of a word.

Anchors help us accomplish this. The most important anchors to remember are:

^, which indicates (in technical terms, asserts) the beginning of a string;
$, which asserts the end of a string;³⁰
\b, which asserts the presence of any type word-boundary (a space, tab, comma, semicolon, etc.).

Here are some examples:

For ^Hello:

Hello got matched, but not the next Hello.
For Hello$:

Hello did not get matched, but there is a match for the next Hello

One thing that’s important to keep in mind about anchors is that they are assertions. This means that they don’t actually count for characters in a match; they merely assert the presence of something: the beginning or end of a string, the boundary of a a word, etc. Thus, there are six tokens in the regular expression ^Hello, but only five characters—the letters in “Hello”—are involved in a match. The^ merely asserts that a match must not only involve the given characters but must also occur at the beginning the string.

What does ^Hello$ match? The rules would say that the string must start with H, continue on with e, l, l and then o, and end there, so you might think that the only possible string containing a match of ^Hello$ is the string “Hello” itself.

But that’s not quite right:

Go your online regex practice site (Regular Expression 101).
Enter the regex hello$.
Then in the “Flags” dropdown menu, check “multiline.”
You should now see “/gm” at the end of the regular expression. You have entered multiline mode.
In the test-text field, enter “I say hello,” press Return, and continue on the next line with “Again I say hello.”
You’ll see that both hello’s match.

The reason for this is that in multiline mode $ stands for the end of each line, not just the absolute end of the string. From time to time you may deal with strings that run over multiple lines, so remember that if you want your ending anchors to represent end-of-line rather than the absolute end of the string, you’ll need to ask R to enter multiline mode. (Later on in the Chapter we’ll discuss some common modes and how to enter them in R.)

The word-boundary anchor \b is quite useful. Consider the regex bed applied to the string below:

bed bedtime perturbed

There are three matches! With the regex \bbed there are just two matches:

bed bedtime perturbed

With the regex \bbed\b the only match is with the actual word “bed”:

bed bedtime perturbed

Note that the beginning and the end of a string count as word-boundaries!

12.3.8 Captures

How would you detect whether a particular instance of a pattern is repeated? For instance, suppose you are looking for occurrences of a word repeated immediately after itself with only a space in between, for example:

“bye bye birdie”
“she said night night”

The regex \b\w+\b (word boundary followed by one or more word characters followed by a word boundary) will match words like “bye” and “night,” but if you simply repeated the pattern, say: \b\w+\b \b\w+\b, then you match strings that don’t exhibit repetition, such as “bye hello” and day night".

What you want is for the first part of the regex to state your pattern—a word of one or more characters—then the space, and then something that represents exactly the match that occurs for the first pattern.

A capture accomplishes this. The regex you want is:

\b(\w+) \1\b

See what it matches in the phrase below:

now it is time for bed bed, yes it is bed bedtime

Here’s how the regex works:

The leading \b requires the presence of a word-boundary, which is satisfied by the presence, in the string, of the space between “for” and “bed bed.” The second “bed bed” is also OK at this point, due to the space between “is” and the first “bed.”
\w+ matches the first “bed,” and the parentheses make it a group. By default the regex captures the contents of whatever portion of the string matches a group, and remembers those contents for later use.
The matches the space between the two “bed” strings.
The \1 is a back-reference: it represents precisely what was matched in the earlier group. For the regex as a whole to produce a match, \1 has to see an exact repetition of whatever string matched the first parenthesis-group in the regex, so it has to see “bed.” At this point both occurrences of “bed bed”" are still in the running to be matches for the entire regex.
The final \b asserts a word-boundary. This is satisfied by the comma after the first “bed bed,” but not by the “t” after the second “bed bed.” Thus only the first “bed bed” matches the regular expression as a whole.

Back-references are denoted \1, \2, and so on, and you can use several of them in the same regex. For example, if you want to match expressions such as “big boat big boat” then use:

\b(\w+) (\w+) \1 \2\b

Think about how the above regex works:

To start, it requires the presence of a word-boundary.
It sets up a capture-group consisting of one or more word characters. Since this is the first set of parentheses, the group can be referenced later on by \1.
It then sets up a second capture-group that may be referenced later on by \2.
We then must see a space …
… followed by the contents of the first group …
… followed by the contents of the second group …
… at a word-boundary.

If you want to match a palindrome³¹ consisting of five characters (“abcba,” “x444x,” etc.) then use:

\b(\w)(\w)\w\2\1\b

12.3.9 Looking Around

Suppose that you have a string containing a number of words involving “bed,” and you would like to find all occurrences of “bed” that begin a word, except for the word “bedtime.” With the tools we have so far this is a difficult task. Fortunately there are look-aheads to simplify our work.

The regex \bbed(?!time\b) will do the job. Here’s how it works:

It begins by asserting a word-boundary.
It continues with the characters to match “bed.”
It concludes a look-ahead group. The parentheses mark out the group. The initial ? indicates that we plan to look ahead. The ! may be thought of as “not equals”; it means that if we find the pattern that follows the ! we will not have a match.

Note the matches in the text below:

bedtime bedrock bedrocking bedsheets bedding embedding

Note that only “bed” is included in the match. Just like the anchor \b, the look-ahead is an assertion: it does not add any characters to the match.

If we want only the occurrences of “bed” where the word begins in “bed” and ends in either “rock” and “time,” then we could use the regex:

\bbed(?=rock\b|time\b)

Note the matches in the following text:

bedtime bedrock bedrocking bedsheets bedding embedding

Of course we could locate the same occurrences with \bbed(rock|time)\b, but the matches would be the entire words, not just the “bed” portion.

There are four types of look-around groups:

Positive look-ahead: regex1(?=regex2). Match when you find an instance of regex2 right after an instance of regex1.
Negative look-ahead: regex1(?!regex). Match EXCEPT when you find an instance of regex2 right after an instance of regex1.
Positive look-behind: (?<=regex2)regex1. Match when you find an instance of regex2 right before an instance of regex1.
Negative look-behind: (?<!regex2)regex1. Match EXCEPT when you find an instance of regex2 right before an instance of regex1.

The regex2 expression in look-aheads can be any regex at all. For look-behinds, though, there are some important limitations. The precise restrictions differ from one flavor of regular expressions to another, but roughly the rule is that the machine has to be able to figure out in advance how many characters it might have to look behind. A group like (?<=time\w*sheets), for instance, would not be permitted, as the quantifier * allows matching strings of arbitrary length.

12.3.10 More to Learn

We have not come near to exhausting the syntax of regular expressions. Readers who would like to delve into the subject more deeply should next consult online tutorials on the topics of non-capture groups, conditionals and more. It would also be good to look at the brief overview of the ICU regex engine provided in stringi-search-regex in the documentation for the stringi package on which stringr is based. However, we now have enough background to express some fairly complex patterns quite concisely, so it is now time to learn how to work with them in R.

Well, actually it asserts the end of the string when we are not in “multiline mode.” In multiline mode it asserts the end of a line within a multiline string. This will be explained shortly.↩︎
Recall that a palindrome is a word that is the same when spelled backwards.↩︎