11.3 Basic String Operations

We now introduce a few basic operations for examining, splitting and combining strings. Some of the functions we discuss come from the stringr package, which is installed along with the tidyverse but is not automatically attached. Make sure you attach it when you work with strings:

library(stringr)

11.3.1 Is and As

Recall from Chapter 2 that as.character() coerces other data types into strings:

as.character(3.14)
## [1] "3.14"
as.character(FALSE)
## [1] "FALSE"
as.character(NULL)
## character(0)

Also, is.character() tests whether an object is character-vector:

is.character(3.14)
## [1] FALSE

11.3.2 Number of Characters

How many characters are in the word “hello?” Let’s try:

length("hello")
## [1] 1

Oh, right, strings don’t exist alone: "hello" is actually a character-vector of length 1. Instead we should use the str_sub() function:

str_length("hello")
## [1] 5

11.3.3 Substrings and Trimming

We can pull out pieces of a string with the str_sub() function:

poppins <- "Supercalifragilisticexpialidocious"
str_sub(poppins, start = 10, end = 20)
## [1] "fragilistic"

One can also use substr() to replace part of a string with some other string:

str_sub(poppins, start = 10, end = 20) <- "ABCDEFGHIJK"
poppins
## [1] "SupercaliABCDEFGHIJKexpialidocious"

Don’t forget: “vector-in, vector-out” usually applies:

words <- c("Mary", "Poppins", "practically", "perfect")
str_length(words)
## [1]  4  7 11  7
str_sub(words, 1, 3)
## [1] "Mar" "Pop" "pra" "per"

In practical data-analysis situations you’ll often have to work with strings that include unexpected non-printed characters at the beginning or the end, especially if the string once occurred at the end of a line in a text file. For example, consider:

lastWord <- "farewell\r\n"
str_length(lastWord)
## [1] 10
cat(lastWord)
## farewell

From its display on the console, you might infer that lastWord consists of only the eight characters: f, a, r, e, w, e, l, and l. (You can’t see the carriage return followed by the newline.) But str_length() clearly shows that it’s got two characters after the final “l,” even if you can’t see them.

If you think your strings might contain unnecessary leading or trailing white-space, you can remove it with str_trim():

str_trim(lastWord)
## [1] "farewell"

11.3.4 Changing Cases

You can make all of the letters in a string lowercase:

str_to_lower("My name is Rhonda.")
## [1] "my name is rhonda."

You can make them all uppercase:

str_to_upper("It makes me wanna holler!")
## [1] "IT MAKES ME WANNA HOLLER!"

11.3.5 Splitting Strings

Consider the following character vector that records several dates:

dates <- c("3-14-1963", "04-01-1965", "12-2-1983")

You might want to print them out in some uniform way, using the full name of the month, perhaps. Then you would need to gain access to the elements of each date separately, so that you could transform month-numbers to month-names.

str_split() will do the job for you:

str_split(dates, pattern = "-")
## [[1]]
## [1] "3"    "14"   "1963"
## 
## [[2]]
## [1] "04"   "01"   "1965"
## 
## [[3]]
## [1] "12"   "2"    "1983"

The result is a list with one element for each date in dates. Each element of the list is a character vector containing the elements—month-number, day-number and year—that were demarcated in the original strings by the hyphen -, the value given to the pattern parameter.

If we wish, we may now access the elements of the list and process them in any way we like. We might report the months, for example:

dates %>% 
  str_split(pattern = "-") %>% 
  unlist() %>% 
  .[c(1, 4, 7)] %>% 
  as.numeric() %>% 
  month.name[.]
## [1] "March"    "April"    "December"

(Note the use in the code above of the months.name constant provided by R.)

Sometimes it’s handy to split a string word-by-word:

"you have won the lottery" %>% 
  str_split(pattern = " ") %>% 
  unlist()
## [1] "you"     "have"    "won"     "the"     "lottery"

Of course splitting on the space would not have worked if some of the words had been separated by more than one space:

"you have won the  lottery" %>% # two spaces betwen 'the' and 'lottery'
  str_split(pattern = " ") %>% 
  unlist()
## [1] "you"     "have"    "won"     "the"     ""        "lottery"

We’ll address this problem soon.

In order to split a string into its constituent characters, split on the string with no characters:

"aardvark" %>% 
  str_split(pattern = "") %>% 
  unlist()
## [1] "a" "a" "r" "d" "v" "a" "r" "k"

This would be useful if you wanted to, say, count the number of occurrences of “a” in a word:

"aardvark" %>% 
  str_split(pattern = "") %>% 
  unlist() %>% 
  .[. == "a"] %>% 
  length()
## [1] 3

11.3.6 Pasting and Joining Strings

We are already familiar with paste(), which allows us to paste together the arguments that are passed to it:

paste("Mary", "Poppins")
## [1] "Mary Poppins"

By default paste() separates the input strings with a space, but you can control this with the sep parameter:

paste("Mary", "Poppins", sep = ":")
## [1] "Mary:Poppins"
paste("Yabba","dabba","doo!", sep = "")
## [1] "Yabbadabbadoo!"

If you want the separator to be the empty string by default, then you could use paste0():

paste0("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"

The stringr version of paste0() is str_c():

str_c("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"

What if you had a character-vector whose elements you wanted to paste together? For example, consider:

poppins <- c("practically", "perfect", "in",
             "every", "way")

Now suppose you want to paste the elements of poppins together into one string where the words are separated by spaces. str_c() will do the job for you, if you use its collapse parameter:

poppins %>% 
  str_c(collapse = " ")
## [1] "practically perfect in every way"

We’ll call this process joining.

In an atomic vector all of the elements have to be of the same data type (all character, all numerical, etc.). What if you want to join objects of different types? If there are only a few, feel free to type them in as separate arguments to str_c():

str_c("March", 14, 1963, sep = " ")
## [1] "March 14 1963"

If the objects are many, then you could arrange for them appear as the elements of a list:

list("Mary", 343, "Poppins", FALSE) %>% 
  str_c(collapse = " ")
## [1] "Mary 343 Poppins FALSE"

Joining appears to be the opposite of splitting, but in R that’s not quite so. Suppose, for instance, that you have dates where the month, day and year are separated by hyphens and you want to replace the hyphens with forward slashes:

3-14-1963  # you have this
3/14/1963  # you want this

You could try this:

"3-14-1963" %>% 
  str_split(pattern = "-") %>% 
  str_c(collapse = "/")
## [1] "c(\"3\", \"14\", \"1963\")"

That’s not what we want. We have to remember than the result of applying str_split() is a list:

"3-14-1963" %>% 
  str_split(pattern = "-")
## [[1]]
## [1] "3"    "14"   "1963"

We need to unlist prior to the join. The correct procedure is:

"3-14-1963" %>% 
  str_split(pattern = "-") %>% 
  unlist() %>% 
  str_c(collapse = "/")
## [1] "3/14/1963"

Now all is well. Soon, though, we’ll learn a superior method for performing substitutions in strings.