12.1 Motivation

Suppose you wish to determine how many times the string “ab” appears within some given string. You could write a function to perform this task.

occurrences <- function(string) {
  count <- 0
  for (i in 1:str_length(string)) {
    stringPart <- str_sub(string, i, i+1)
    if ( stringPart == "ab" ) {
      count <- count +1
    }
  }
  count
}

Let’s try it out:

occurrences("yabbadabbadoo!")

## [1] 2

This looks right, as there are indeed exactly two occurrences of “ab”: one at beginning at the second character and another beginning at the seventh character.

Suppose instead that we are interested in counting occurrences, in some arbitrary given string, of any of the following three strings:

“ab”
“Ab”
“foo”

How might we handle this task? Again we could write a function. This time we will generalize it a bit, allowing the user to input, along with the string to be searched, a vector of the sub-strings of interest.

# function to count occurrences of substrings in string.
# substrings are given as patterns
occurrences2 <- function(string, patterns) {
  count <- 0
  for (i in 1:str_length(string)) {
    for (j in 1:length(patterns)) {
      pattern <- patterns[j]
      len <- str_length(pattern)
      stringPart <- str_sub(string, i, i + len - 1)
      if ( stringPart == pattern ) {
        count <- count +1
      }
    }
  }
  count
}

We try out our function on the string “This Labrador is a fool, Abba.” which matches each of our patterns exactly once, for a total of three matches,

occurrences2("This Labrador is a fool, Abba.",
           patterns = c("ab", "Ab", "foo"))

## [1] 3

Well, and good, but … the coding is beginning to get a bit complex. What if we were searching instead for, say, sub-strings that resemble a phone number with an area code, i.e., strings of the form:

ddd-ddd-dddd

(Here the d’s represent digits from 0 to 9.)

There are \(10^{10}\) patterns of interest!²⁹ How would we go about describing them all to R?

Fortunately, regular expressions are there to help us out. A regular expression is defined as a sequence of characters that represents a pattern that might or might not be present in any given string. A computer will rely on a regular expression engine—a specific implementation of a system of regular expressions—to use a given regular expression to search in text for matches to the pattern that the expression represents.

In practice, regular expressions are like a miniature programming language within a programming language. They are a feature of most major programming languages, including R. With regular expressions we can describe complex string-patterns concisely, and can perform rapid searches for these patterns in a given body of text.

The rules for regular expressions vary a bit from one language to another, but the general idea is essentially the same for all of them. In the remainder of this Chapter we’ll learn enough of the principles of regular expressions to describe basic, useful patterns, and we’ll also study R-functions that make use of them.

First of all, here’s a quick example to show the power of regular expressions. The work done by occurences2() may also be done in one line with the str_count() function, as follows:

str_count(string = "This Labrador is a fool, Abba.",
         pattern = "[Aa]b|foo")

## [1] 3

Wow, that was quick. But what in the world is that "[Aa]b|foo" argument to the pattern parameter?

It’s a regular expression! It tells R to look for sub-strings that EITHER:

start with either “A” or “a,” and are then followed by a “b,” OR
consist of “foo”

Clearly it is high time that we learn a bit of regular-expression syntax.

10 digits in a phone number, each of which could be chose in 10 different ways. This results in \(10^{10}\), or ten billion possibilities.↩︎