12.4 Entering a Regex in R

We now return to the R-language and consider how to apply regular expressions within it.

12.4.1 String to Regex

Regular expressions actually play a role in one of the functions you already know, namely the function strsplit(). Recall that you can use the split parameter to specify the sub-string that separates the strings you want to split up. It works like this:

"hello there Mary Poppins" %>% 
  str_split(pattern = " ") %>% 
  unlist()
## [1] "hello"   "there"   "Mary"    "Poppins"

The task of splitting would appear to a quite challenging if the words are separated in more complex ways, with any amount of white-space. Consider, for example:

myString <- "hello\t\tthere\n\nMary  \t Poppins"
cat(myString)
## hello        there
## 
## Mary      Poppins

But really it’s not any more difficult, because the split parameter actually takes the string it is given and converts it to a regular expression, splitting on anything that matches. Watch this:

myString %>% 
  str_split(pattern = "\\s+") %>% 
  unlist()
## [1] "hello"   "there"   "Mary"    "Poppins"

We can almost see how this works. Recall from the last section that the regex \s is a character class shortcut for any white-space character, so \s+ stands for one or more white-spaces in succession: precisely the mixtures of tab, spaces and newlines that separated the words in our string. str_split() must be splitting on matches to the regex \s+.

So why did we set pattern = "\\s+"? What’s with the extra backslash?

The reason is that the argument passed with pattern is a string, not a regular expression object. It starts out life, as if were, as a string, and R converts it to a regular expression, then hands the regex over to its regular expression engine to locate the matches in myString that in turn determine how myString is to be split up. Since in R’s string-world “\s” is not a recognized character in the way that newline (“\n”), tab (“\t”) and other control-characters are, R won’t accept “\s+” as a valid string. Try it for your self:

myString %>% 
  str_split(pattern = "\s+")
## Error: '\s' is an unrecognized escape in character string starting ""\s"

It follows that when you enter regular expressions as strings in R, you’ll have to remember to escape the back-slashed tokens that are used in a regular expression. Table 12.3 gives several examples of this.

Table 12.3: Examples of entry of regular expressions as strings, in R.
Regular Expression Entered as String
\s+ “\\s+”
find\.dot “find\\.dot”
^\w*\d{1,3}$ "^\\w*\\d{1,3}$"

Keeping in mind the need for an occasional additional escape, it should not be too difficult for you to enter regular expressions in R.

12.4.2 Substitution

One of the most useful applications of regular expressions is in substitution. Suppose that we have a vector of dates:

dates <- c("3 - 14 - 1963", "4/13/ 2005",
           "12-1-1997", "11 / 11 / 1918")

It seems that the folks who entered the dates were not consistent in how to format them. In order to make analysis easier, it would be better if all the dates had exactly the same format. With the function str_replace_all() and regular expressions, this is not difficult:

dates %>% 
  str_replace_all(pattern = "[- /]+",
              replacement = "/")
## [1] "3/14/1963"  "4/13/2005"  "12/1/1997"  "11/11/1918"

Here:

  • x (not explicitly seen above, due to the piping) is the text in which the substitution occurs;
  • pattern is the regex for the type of sub-string we want to replace;
  • replacment is what we want to replace matches of the pattern with.

The all in the name of the function means that we want to replace all occurrences of the pattern with the replacement text. There is also a str_replace() function that performs replacement only with the first match (if any) that it finds:

dates %>% 
  str_replace(pattern = "[- /]+",
              replacement = "/")
## [1] "3/14 - 1963"  "4/13/ 2005"   "12/1-1997"    "11/11 / 1918"

In our application, that’s certainly NOT what we need. However, in cases where you happen to know that there will be at most one match, str_replace() gets the job done faster than str_replace_all(), which is forced to search through the entire string.

12.4.3 Patterned Replacement

In the dates example from Section 12.4.2 the replacement string (the argument for the parameter replacement) was constant: no matter what sort of match we found for the pattern [- /]+, we replaced it with the string “/”. It is important to note, however, that the argument provided for replacement can cause the replacement to vary depending upon the match found. In particular:

  • It can include the back-references \1, \2, …, \9.
  • It can be a defined function of the match.

Let’s look at an example. Here is a function that, given a string, will double all of the vowels that it finds:

doubleVowels <- function(str) {
  str %>% 
    str_replace_all(pattern = "([aeiou])", 
                    replacement = "\\1\\1")
}
doubleVowels("Far and away the best!")
## [1] "Faar aand aawaay thee beest!"

Note that the pattern [aeiou] for vowels had to be enclosed in parentheses so that it could be captured and referred to by the back-reference \1. Also note that, since R converts the replacement string into a pattern, extra backslash escapes are required, just as in regular expressions.

Here is a function to capitalize every vowel found:

capVowels <- function(str) {
  str %>% 
    str_replace_all(pattern = "[aeiou]", 
                    replacement = function(x) str_to_upper(x))
}
capVowels("Far and away the best!")
## [1] "FAr And AwAy thE bEst!"

Here is another function that searches for repeated words and encloses each pair in asterisks:

starRepeats <- function(str) {
  str %>% 
    str_replace_all(pattern = "\\b(\\w+) \\1\\b",
                    replacement = function(x) {
                      str_c("*", x, "*")
                    }
                    )
}
starRepeats("I have a boo boo on my knee knee.")
## [1] "I have a *boo boo* on my *knee knee*."

12.4.4 Detecting Matches

If you have many strings—in a character-vector, say—and you want to select those that contain a match to a particular pattern, then you want to use str_subset().

Consider, for example, the vector of strings:

sentences <- c("My name is Tom, Sir",
               "And I'm Tulip!",
               "Whereas my name is Lester.")

If we would like to find the strings that contain a word beginning with capital T, we could proceed as follows:

sentences %>% 
  str_subset(pattern = "\\bT\\w*\\b")
## [1] "My name is Tom, Sir" "And I'm Tulip!"

str_subset() returns a vector consisting of the elements of the vector sentences where the string contains at least one word beginning with “T”.

A related function is str_detect():

sentences %>% 
  str_detect(pattern = "\\bT\\w*\\b")
## [1]  TRUE  TRUE FALSE

str_detect() returns a logical vector with TRUE where sentences has a capital-T word, FALSE otherwise.

Finally, str_locate() gives the positions in each string where a match begins:

sentences %>% 
  str_locate(pattern = "\\bT\\w*\\b")
##      start end
## [1,]    12  14
## [2,]     9  13
## [3,]    NA  NA

12.4.5 Extracting Matches

If you require what is actually matched within each string you are processing, then you should look into str_extract() and str_extract_all().

As an example, let’s extract pairs of words beginning with the same letter in sentences2 defined below:

sentences2 <- c("The big bad wolf is walking warily to the cottage.",
                "He huffs and he puffs peevishly.",
                "He wears gnarly gargantuan bell bottoms!")
sentences2 %>% 
  str_extract(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [1] "big bad"           "puffs peevishly"   "gnarly gargantuan"

The results are returned as a character vector, in which each element is the first matching pair in the corresponding sentence.

If we want all of the matches in each sentence, then we use str_extract_all():

sentences2 %>% 
  str_extract_all(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [[1]]
## [1] "big bad"        "walking warily" "to the"        
## 
## [[2]]
## [1] "puffs peevishly"
## 
## [[3]]
## [1] "gnarly gargantuan" "bell bottoms"

Sometimes we want even more information. Suppose, for example, that we want not only the first matching word-pair, but also the repeated initial letter that permitted the match in the first place. In that case we need str_match():

sentences2 %>% 
  str_match(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
##      [,1]                [,2]
## [1,] "big bad"           "b" 
## [2,] "puffs peevishly"   "p" 
## [3,] "gnarly gargantuan" "g"

str_match() returns a matrix, each row of which corresponds to an element of sentences. The first column gives the value of the entire match, and the second column gives value of the capture-group in the regular expression. If the regular expression had used more capture groups, then the matrix would have had additional columns showing the values of the captures, in order.

If you want an analysis of all the matches in a string, then use str_match_all():

sentences2 %>% 
  str_match_all(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [[1]]
##      [,1]             [,2]
## [1,] "big bad"        "b" 
## [2,] "walking warily" "w" 
## [3,] "to the"         "t" 
## 
## [[2]]
##      [,1]              [,2]
## [1,] "puffs peevishly" "p" 
## 
## [[3]]
##      [,1]                [,2]
## [1,] "gnarly gargantuan" "g" 
## [2,] "bell bottoms"      "b"

The returned structure is a list and hence more complex, but you can query it for the values you need.

12.4.6 Extraction in Data Frames

Quite often you will want to manipulate strings in the context of working with a data frame. For this regex functions we have examined so far will be quite useful, but you should also know about the extract() function from the tidyr package, which is among the packages attached by the tidy-verse.

Imagine a data table that contains some names and phone numbers:

people <- data.frame(
  name = c("Lauf, Bettina", "Bachchan, Abhishek", "Jones,  Jenna"),
  phone = c("(202) 415-3785", "4133372100", "310-231-4453")
)

Each person has a standard ten-digit phone number, consisting of:

  • the three-digit area code;
  • the three digit central office number;
  • the four-digit line number.

Suppose we would like to create three new variables in the data table, one for each of the three components of the phone number. For this, tidyr::extract() comes in handy:

people %>% 
  tidyr::extract(col = phone,
          into = c("area", "office", "line"),
          regex = "(?x)        # for comments
                  .*           # in case of opening paren, etc.
                  (\\d{3})     # capture 1:  area code
                  .*           # possible separators
                  (\\d{3})     # capture 2:  central office
                  .*           # possible separators
                  (\\d{4})     # capture 3:  line number
                  ")
##                 name area office line
## 1      Lauf, Bettina  202    415 3785
## 2 Bachchan, Abhishek  413    337 2100
## 3      Jones,  Jenna  310    231 4453

By default extract() removes the original column, but you can preserve it with remove = FALSE. (For the format of the regular expression in the above call, see the next sub-section.)

12.4.7 Counting Matches

The function str_count() provides a very convenient way to tally up the number of matches that a given regex has in a string. Here we use it to count the number of words in a string that begin with a lower or uppercase p.

strings <- c("Mary Poppins is practically perfect in every way!",
             "The best-laid plans of mice and men gang oft astray.",
             "Peter Piper picked a peck of pickled peppers.")
strings %>%
  str_count(pattern = "\\b[Pp]\\w*\\b")
## [1] 3 1 6

How might we find the words in a string that contain three or more of the same letter? In this case str_count() would not be useful. However we could try something like this:

"In Patagonia, the peerless Peter Piper picked a peck of pickled peppers." %>% 
  str_split("\\W+") %>% 
  unlist() %>% 
  str_subset(pattern = "([[:alpha:]]).*\\1.*\\1")
## [1] "Patagonia" "peerless"  "peppers"

12.4.8 Regex Modes

If you have been practicing consistently with an online regex site, you will have noticed by now that a regex can be accompanied by various options. In most implementations they appear as letters after the closing regex delimiter, like this:

/regex/gm

Some of the most popular options are:

  • g: “global”, looking for all possible matches in the string;
  • i: “case-insensitive” mode, so that letter-characters in the regex match both their upper and lower-case versions;
  • m: “multiline” mode, so that the anchors ^ and $ are attached to newlines within the string rather than to the absolute beginning and end of the string;
  • x: “white-space” mode, where white-spaces in the regex are ignored unless they are escaped (useful for lining out the regex and inserting comments to explain its operation).

Since stringr has both global and non-global versions of regex functions you probably will not bother with g, but the other options—known technically as modes—can sometimes be useful.

If you would like to set modes to apply to your entire regex, insert it (or them) like this at the beginning of the expression:

(?im)t[aeiou]{1,3}$

In the example above, we are in both case-insensitive and multiline mode, and we are looking for t or T followed by 1, 2 or 3 vowels (upper or lower) at the end of any line in a (possibly) multiline string.

Following is an example of the mode to ignore white-space and to ignore case:

myPattern <-
  "(?xi)       # ignore whitespace (x) and ignore case (i)
  \\b          # assert a word-boundary
  (\\w)        # capture the first letter of the first word
  \\w*         # rest of the first word
  \\W+         # one or more non-word characters
  \\1          # repeat the letter captured previously
  \\w*         # rest of the second word
  "
sentences2 %>% 
  str_match_all(pattern = myPattern)
## [[1]]
##      [,1]             [,2]
## [1,] "big bad"        "b" 
## [2,] "walking warily" "w" 
## [3,] "to the"         "t" 
## 
## [[2]]
##      [,1]              [,2]
## [1,] "He huffs"        "H" 
## [2,] "puffs peevishly" "p" 
## 
## [[3]]
##      [,1]                [,2]
## [1,] "gnarly gargantuan" "g" 
## [2,] "bell bottoms"      "b"

Due to the presence of x-flag at the very beginning of the regex, the regex engine knows to ignore white-space throughout, and it will also ignore hash-tags and whatever comes after them on a line. This permits the placement of comments within the regular expression. The i-flag directs the regex engine to ignore case when looking for matches. Accordingly, in the example we pick up the extra match “he huffs”.

Some people prefer to control regular-expression modes by means of stringr’s regex() function:

myPattern <- regex(
  pattern = "
  \\b          # assert a word-boundary
  (\\w)        # capture the first letter of the first word
  \\w*         # rest of the first word
  \\W+         # one or more non-word characters
  \\1          # repeat the letter captured previously
  \\w*         # rest of the second word
  ",
  comments = TRUE,
  ignore_case = TRUE
)
sentences2 %>% 
  str_match_all(pattern = myPattern)
## [[1]]
##      [,1]             [,2]
## [1,] "big bad"        "b" 
## [2,] "walking warily" "w" 
## [3,] "to the"         "t" 
## 
## [[2]]
##      [,1]              [,2]
## [1,] "He huffs"        "H" 
## [2,] "puffs peevishly" "p" 
## 
## [[3]]
##      [,1]                [,2]
## [1,] "gnarly gargantuan" "g" 
## [2,] "bell bottoms"      "b"

12.4.9 Practice Exercises

The stringr package comes with fruit, a character-vector of giving the names of 80 fruits.

  1. Determine how many fruit-names consist of exactly two words.

  2. Find the two-word fruit-names.

  3. Find the indices of the two-word fruit names.

  4. Find the one-word fruit-names that end in “berry”.

  5. Find the fruit-names that contain more than three vowels.

  6. In the word “banana” the string “an” appears twice in succession, as does the string “na”. Find the fruit-names containing at least one string of length two or more that appears twice in succession.

  7. To the people data frame from this section, add two new variables: first for the first name and last for the last name. The original name variable should be removed.

12.4.10 Solutions to the Practice Exercises

  1. Try this:

    wordCount <- fruit %>% str_count("\\w+")
    sum(wordCount == 2)
    ## [1] 11
  2. Try this:

    fruit %>% .[str_count(., "\\w+") == 2]
    ##  [1] "bell pepper"       "blood orange"      "canary melon"     
    ##  [4] "chili pepper"      "goji berry"        "kiwi fruit"       
    ##  [7] "purple mangosteen" "rock melon"        "salal berry"      
    ## [10] "star fruit"        "ugli fruit"

    Another way is as follows:

    fruit %>% 
      str_subset("^\\w+\\s+\\w+$")
    ##  [1] "bell pepper"       "blood orange"      "canary melon"     
    ##  [4] "chili pepper"      "goji berry"        "kiwi fruit"       
    ##  [7] "purple mangosteen" "rock melon"        "salal berry"      
    ## [10] "star fruit"        "ugli fruit"
  3. Try this:

    fruit %>% 
      str_detect("^\\w+\\s+\\w+$") %>% 
      which()
    ##  [1]  5  9 13 17 32 42 66 72 73 75 79
  4. Try this:

    fruit %>% 
      str_subset("\\w+berry$")
    ##  [1] "bilberry"    "blackberry"  "blueberry"   "boysenberry" "cloudberry" 
    ##  [6] "cranberry"   "elderberry"  "gooseberry"  "huckleberry" "mulberry"   
    ## [11] "raspberry"   "strawberry"
  5. Try this:

    vowelCount <- 
      fruit %>% 
      str_count("[aeiou]")
    fruit[vowelCount > 3]
    ##  [1] "avocado"           "blood orange"      "breadfruit"       
    ##  [4] "canary melon"      "cantaloupe"        "cherimoya"        
    ##  [7] "chili pepper"      "clementine"        "dragonfruit"      
    ## [10] "feijoa"            "gooseberry"        "grapefruit"       
    ## [13] "kiwi fruit"        "mandarine"         "nectarine"        
    ## [16] "passionfruit"      "pineapple"         "pomegranate"      
    ## [19] "purple mangosteen" "tamarillo"         "tangerine"        
    ## [22] "ugli fruit"        "watermelon"
  6. Try this:

    fruit %>% 
      str_subset("(\\w{2,})\\1")
    ## [1] "banana"      "coconut"     "cucumber"    "jujube"      "papaya"     
    ## [6] "salal berry"
  7. Try this:

    people %>% 
      tidyr::extract(col = name,
              into = c("last", "first"),
              regex = "(\\w+)\\W+(\\w+)")
    ##       last    first          phone
    ## 1     Lauf  Bettina (202) 415-3785
    ## 2 Bachchan Abhishek     4133372100
    ## 3    Jones    Jenna   310-231-4453