12.4 Entering a Regex in R
We now return to the R-language and consider how to apply regular expressions within it.
12.4.1 String to Regex
Regular expressions actually play a role in one of the functions you already know, namely the function strsplit()
. Recall that you can use the split
parameter to specify the sub-string that separates the strings you want to split up. It works like this:
"hello there Mary Poppins" %>%
str_split(pattern = " ") %>%
unlist()
## [1] "hello" "there" "Mary" "Poppins"
The task of splitting would appear to a quite challenging if the words are separated in more complex ways, with any amount of white-space. Consider, for example:
<- "hello\t\tthere\n\nMary \t Poppins"
myString cat(myString)
## hello there
##
## Mary Poppins
But really it’s not any more difficult, because the split
parameter actually takes the string it is given and converts it to a regular expression, splitting on anything that matches. Watch this:
%>%
myString str_split(pattern = "\\s+") %>%
unlist()
## [1] "hello" "there" "Mary" "Poppins"
We can almost see how this works. Recall from the last section that the regex \s
is a character class shortcut for any white-space character, so \s+
stands for one or more white-spaces in succession: precisely the mixtures of tab, spaces and newlines that separated the words in our string. str_split()
must be splitting on matches to the regex \s+
.
So why did we set pattern = "\\s+"
? What’s with the extra backslash?
The reason is that the argument passed with pattern
is a string, not a regular expression object. It starts out life, as if were, as a string, and R converts it to a regular expression, then hands the regex over to its regular expression engine to locate the matches in myString
that in turn determine how myString
is to be split up. Since in R’s string-world “\s” is not a recognized character in the way that newline (“\n”), tab (“\t”) and other control-characters are, R won’t accept “\s+” as a valid string. Try it for your self:
%>%
myString str_split(pattern = "\s+")
## Error: '\s' is an unrecognized escape in character string starting ""\s"
It follows that when you enter regular expressions as strings in R, you’ll have to remember to escape the back-slashed tokens that are used in a regular expression. Table 12.3 gives several examples of this.
Regular Expression | Entered as String |
---|---|
\s+ | “\\s+” |
find\.dot | “find\\.dot” |
^\w*\d{1,3}$ | "^\\w*\\d{1,3}$" |
Keeping in mind the need for an occasional additional escape, it should not be too difficult for you to enter regular expressions in R.
12.4.2 Substitution
One of the most useful applications of regular expressions is in substitution. Suppose that we have a vector of dates:
<- c("3 - 14 - 1963", "4/13/ 2005",
dates "12-1-1997", "11 / 11 / 1918")
It seems that the folks who entered the dates were not consistent in how to format them. In order to make analysis easier, it would be better if all the dates had exactly the same format. With the function str_replace_all()
and regular expressions, this is not difficult:
%>%
dates str_replace_all(pattern = "[- /]+",
replacement = "/")
## [1] "3/14/1963" "4/13/2005" "12/1/1997" "11/11/1918"
Here:
x
(not explicitly seen above, due to the piping) is the text in which the substitution occurs;pattern
is the regex for the type of sub-string we want to replace;replacment
is what we want to replace matches of the pattern with.
The all
in the name of the function means that we want to replace all occurrences of the pattern with the replacement text. There is also a str_replace()
function that performs replacement only with the first match (if any) that it finds:
%>%
dates str_replace(pattern = "[- /]+",
replacement = "/")
## [1] "3/14 - 1963" "4/13/ 2005" "12/1-1997" "11/11 / 1918"
In our application, that’s certainly NOT what we need. However, in cases where you happen to know that there will be at most one match, str_replace()
gets the job done faster than str_replace_all()
, which is forced to search through the entire string.
12.4.3 Patterned Replacement
In the dates example from Section 12.4.2 the replacement string (the argument for the parameter replacement
) was constant: no matter what sort of match we found for the pattern [- /]+
, we replaced it with the string “/.” It is important to note, however, that the argument provided for replacement
can cause the replacement to vary depending upon the match found. In particular:
- It can include the back-references
\1
,\2
, …,\9
. - It can be a defined function of the match.
Let’s look at an example. Here is a function that, given a string, will double all of the vowels that it finds:
<- function(str) {
doubleVowels %>%
str str_replace_all(pattern = "([aeiou])",
replacement = "\\1\\1")
}doubleVowels("Far and away the best!")
## [1] "Faar aand aawaay thee beest!"
Note that the pattern [aeiou]
for vowels had to be enclosed in parentheses so that it could be captured and referred to by the back-reference \1
. Also note that, since R converts the replacement string into a pattern, extra backslash escapes are required, just as in regular expressions.
Here is a function to capitalize every vowel found:
<- function(str) {
capVowels %>%
str str_replace_all(pattern = "[aeiou]",
replacement = function(x) str_to_upper(x))
}capVowels("Far and away the best!")
## [1] "FAr And AwAy thE bEst!"
Here is another function that searches for repeated words and encloses each pair in asterisks:
<- function(str) {
starRepeats %>%
str str_replace_all(pattern = "\\b(\\w+) \\1\\b",
replacement = function(x) {
str_c("*", x, "*")
}
)
}starRepeats("I have a boo boo on my knee knee.")
## [1] "I have a *boo boo* on my *knee knee*."
12.4.4 Detecting Matches
If you have many strings—in a character-vector, say—and you want to select those that contain a match to a particular pattern, then you want to use str_subset()
.
Consider, for example, the vector of strings:
<- c("My name is Tom, Sir",
sentences "And I'm Tulip!",
"Whereas my name is Lester.")
If we would like to find the strings that contain a word beginning with capital T, we could proceed as follows:
%>%
sentences str_subset(pattern = "\\bT\\w*\\b")
## [1] "My name is Tom, Sir" "And I'm Tulip!"
str_subset()
returns a vector consisting of the elements of the vector sentences
where the string contains at least one word beginning with “T.”
A related function is str_detect()
:
%>%
sentences str_detect(pattern = "\\bT\\w*\\b")
## [1] TRUE TRUE FALSE
str_detect()
returns a logical vector with TRUE
where sentences
has a capital-T word, FALSE
otherwise.
Finally, str_locate()
gives the positions in each string where a match begins:
%>%
sentences str_locate(pattern = "\\bT\\w*\\b")
## start end
## [1,] 12 14
## [2,] 9 13
## [3,] NA NA
12.4.5 Extracting Matches
If you require what is actually matched within each string you are processing, then you should look into str_extract()
and str_extract_all()
.
As an example, let’s extract pairs of words beginning with the same letter in sentences2
defined below:
<- c("The big bad wolf is walking warily to the cottage.",
sentences2 "He huffs and he puffs peevishly.",
"He wears gnarly gargantuan bell bottoms!")
%>%
sentences2 str_extract(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [1] "big bad" "puffs peevishly" "gnarly gargantuan"
The results are returned as a character vector, in which each element is the first matching pair in the corresponding sentence.
If we want all of the matches in each sentence, then we use str_extract_all()
:
%>%
sentences2 str_extract_all(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [[1]]
## [1] "big bad" "walking warily" "to the"
##
## [[2]]
## [1] "puffs peevishly"
##
## [[3]]
## [1] "gnarly gargantuan" "bell bottoms"
Sometimes we want even more information. Suppose, for example, that we want not only the first matching word-pair, but also the repeated initial letter that permitted the match in the first place. In that case we need str_match()
:
%>%
sentences2 str_match(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [,1] [,2]
## [1,] "big bad" "b"
## [2,] "puffs peevishly" "p"
## [3,] "gnarly gargantuan" "g"
str_match()
returns a matrix, each row of which corresponds to an element of sentences
. The first column gives the value of the entire match, and the second column gives value of the capture-group in the regular expression. If the regular expression had used more capture groups, then the matrix would have had additional columns showing the values of the captures, in order.
If you want an analysis of all the matches in a string, then use str_match_all()
:
%>%
sentences2 str_match_all(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
## [[1]]
## [,1] [,2]
## [1,] "big bad" "b"
## [2,] "walking warily" "w"
## [3,] "to the" "t"
##
## [[2]]
## [,1] [,2]
## [1,] "puffs peevishly" "p"
##
## [[3]]
## [,1] [,2]
## [1,] "gnarly gargantuan" "g"
## [2,] "bell bottoms" "b"
The returned structure is a list and hence more complex, but you can query it for the values you need.
12.4.6 Extraction in Data Frames
Quite often you will want to manipulate strings in the context of working with a data frame. For this regex functions we have examined so far will be quite useful, but you should also know about the extract()
function from the tidyr package, which is among the packages attached by the tidy-verse.
Imagine a data table that contains some names and phone numbers:
<- data.frame(
people name = c("Lauf, Bettina", "Bachchan, Abhishek", "Jones, Jenna"),
phone = c("(202) 415-3785", "4133372100", "310-231-4453")
)
Each person has a standard ten-digit phone number, consisting of:
- the three-digit area code;
- the three digit central office number;
- the four-digit line number.
Suppose we would like to create three new variables in the data table, one for each of the three components of the phone number. For this, tidyr::extract()
comes in handy:
%>%
people ::extract(col = phone,
tidyrinto = c("area", "office", "line"),
regex = "(?x) # for comments
.* # in case of opening paren, etc.
(\\d{3}) # capture 1: area code
.* # possible separators
(\\d{3}) # capture 2: central office
.* # possible separators
(\\d{4}) # capture 3: line number
")
## name area office line
## 1 Lauf, Bettina 202 415 3785
## 2 Bachchan, Abhishek 413 337 2100
## 3 Jones, Jenna 310 231 4453
By default extract()
removes the original column, but you can preserve it with remove = FALSE
. (For the format of the regular expression in the above call, see the next sub-section.)
12.4.7 Counting Matches
The function str_count()
provides a very convenient way to tally up the number of matches that a given regex has in a string. Here we use it to count the number of words in a string that begin with a lower or uppercase p
.
<- c("Mary Poppins is practically perfect in every way!",
strings "The best-laid plans of mice and men gang oft astray.",
"Peter Piper picked a peck of pickled peppers.")
%>%
strings str_count(pattern = "\\b[Pp]\\w*\\b")
## [1] 3 1 6
How might we find the words in a string that contain three or more of the same letter? In this case str_count()
would not be useful. However we could try something like this:
"In Patagonia, the peerless Peter Piper picked a peck of pickled peppers." %>%
str_split("\\W+") %>%
unlist() %>%
str_subset(pattern = "([[:alpha:]]).*\\1.*\\1")
## [1] "Patagonia" "peerless" "peppers"
12.4.8 Regex Modes
If you have been practicing consistently with an online regex site, you will have noticed by now that a regex can be accompanied by various options. In most implementations they appear as letters after the closing regex delimiter, like this:
/regex/gm
Some of the most popular options are:
g
: “global,” looking for all possible matches in the string;i
: “case-insensitive” mode, so that letter-characters in the regex match both their upper and lower-case versions;m
: “multiline” mode, so that the anchors^
and$
are attached to newlines within the string rather than to the absolute beginning and end of the string;x
: “white-space” mode, where white-spaces in the regex are ignored unless they are escaped (useful for lining out the regex and inserting comments to explain its operation).
Since stringr has both global and non-global versions of regex functions you probably will not bother with g
, but the other options—known technically as modes—can sometimes be useful.
If you would like to set modes to apply to your entire regex, insert it (or them) like this at the beginning of the expression:
(?im)t[aeiou]{1,3}$
In the example above, we are in both case-insensitive and multiline mode, and we are looking for t or T followed by 1, 2 or 3 vowels (upper or lower) at the end of any line in a (possibly) multiline string.
Following is an example of the mode to ignore white-space and to ignore case:
<-
myPattern "(?xi) # ignore whitespace (x) and ignore case (i)
\\b # assert a word-boundary
(\\w) # capture the first letter of the first word
\\w* # rest of the first word
\\W+ # one or more non-word characters
\\1 # repeat the letter captured previously
\\w* # rest of the second word
"
%>%
sentences2 str_match_all(pattern = myPattern)
## [[1]]
## [,1] [,2]
## [1,] "big bad" "b"
## [2,] "walking warily" "w"
## [3,] "to the" "t"
##
## [[2]]
## [,1] [,2]
## [1,] "He huffs" "H"
## [2,] "puffs peevishly" "p"
##
## [[3]]
## [,1] [,2]
## [1,] "gnarly gargantuan" "g"
## [2,] "bell bottoms" "b"
Due to the presence of x
-flag at the very beginning of the regex, the regex engine knows to ignore white-space throughout, and it will also ignore hash-tags and whatever comes after them on a line. This permits the placement of comments within the regular expression. The i
-flag directs the regex engine to ignore case when looking for matches. Accordingly, in the example we pick up the extra match “he huffs.”
Some people prefer to control regular-expression modes by means of stringr’s regex()
function:
<- regex(
myPattern pattern = "
\\b # assert a word-boundary
(\\w) # capture the first letter of the first word
\\w* # rest of the first word
\\W+ # one or more non-word characters
\\1 # repeat the letter captured previously
\\w* # rest of the second word
",
comments = TRUE,
ignore_case = TRUE
)%>%
sentences2 str_match_all(pattern = myPattern)
## [[1]]
## [,1] [,2]
## [1,] "big bad" "b"
## [2,] "walking warily" "w"
## [3,] "to the" "t"
##
## [[2]]
## [,1] [,2]
## [1,] "He huffs" "H"
## [2,] "puffs peevishly" "p"
##
## [[3]]
## [,1] [,2]
## [1,] "gnarly gargantuan" "g"
## [2,] "bell bottoms" "b"
12.4.9 Practice Exercises
The stringr package comes with fruit
, a character-vector of giving the names of 80 fruits.
Determine how many fruit-names consist of exactly two words.
Find the two-word fruit-names.
Find the indices of the two-word fruit names.
Find the one-word fruit-names that end in “berry.”
Find the fruit-names that contain more than three vowels.
In the word “banana” the string “an” appears twice in succession, as does the string “na.” Find the fruit-names containing at least one string of length two or more that appears twice in succession.
To the
people
data frame from this section, add two new variables:first
for the first name andlast
for the last name. The originalname
variable should be removed.
12.4.10 Solutions to the Practice Exercises
Try this:
<- fruit %>% str_count("\\w+") wordCount sum(wordCount == 2)
## [1] 11
Try this:
%>% .[str_count(., "\\w+") == 2] fruit
## [1] "bell pepper" "blood orange" "canary melon" "chili pepper" ## [5] "goji berry" "kiwi fruit" "purple mangosteen" "rock melon" ## [9] "salal berry" "star fruit" "ugli fruit"
Another way is as follows:
%>% fruit str_subset("^\\w+\\s+\\w+$")
## [1] "bell pepper" "blood orange" "canary melon" "chili pepper" ## [5] "goji berry" "kiwi fruit" "purple mangosteen" "rock melon" ## [9] "salal berry" "star fruit" "ugli fruit"
Try this:
%>% fruit str_detect("^\\w+\\s+\\w+$") %>% which()
## [1] 5 9 13 17 32 42 66 72 73 75 79
Try this:
%>% fruit str_subset("\\w+berry$")
## [1] "bilberry" "blackberry" "blueberry" "boysenberry" "cloudberry" "cranberry" ## [7] "elderberry" "gooseberry" "huckleberry" "mulberry" "raspberry" "strawberry"
Try this:
<- vowelCount %>% fruit str_count("[aeiou]") > 3] fruit[vowelCount
## [1] "avocado" "blood orange" "breadfruit" "canary melon" ## [5] "cantaloupe" "cherimoya" "chili pepper" "clementine" ## [9] "dragonfruit" "feijoa" "gooseberry" "grapefruit" ## [13] "kiwi fruit" "mandarine" "nectarine" "passionfruit" ## [17] "pineapple" "pomegranate" "purple mangosteen" "tamarillo" ## [21] "tangerine" "ugli fruit" "watermelon"
Try this:
%>% fruit str_subset("(\\w{2,})\\1")
## [1] "banana" "coconut" "cucumber" "jujube" "papaya" "salal berry"
Try this:
%>% people ::extract(col = name, tidyrinto = c("last", "first"), regex = "(\\w+)\\W+(\\w+)")
## last first phone ## 1 Lauf Bettina (202) 415-3785 ## 2 Bachchan Abhishek 4133372100 ## 3 Jones Jenna 310-231-4453