11 Strings

RTL, by xkcd.

Figure 11.1: RTL, by xkcd.

In this chapter we will take a closer look at character-vectors, and in particular at character vectors of length one, which are commonly called “strings.” The ability to manipulate strings is the foundation for all text-processing in computer programming.

11.1 Character Vectors: Strings

Computers work at least as much with text as they do with numbers. In computer science the values that refer to text are called strings.

In R, as in most other programming languages, we use quotes as delimiters, meaning that they mark the beginning and the end of strings. Recall that in R, strings are of type character. For example:

greeting <- "hello"
typeof(greeting)
## [1] "character"

Of course, a single string does not exist on its own in R. Instead it exists as the only element of a character-vector of length 1.

is.vector(greeting)
## [1] TRUE
length(greeting)
## [1] 1

To make strings we can use double quotes or single quotes. Since the string-value does not include the quotes themselves but only what appears between them, it does not make any difference which type of quotes we use:

greeting1 <- "hello"
greeting2 <- 'hello'
greeting1 == greeting2
## [1] TRUE

When we make a character vector of length greater than one, we can even use both single and double quotes:

politeWords <- c("Please?", 'Thank you!')
politeWords
## [1] "Please?"    "Thank you!"

Notice that when R prints politeWords to the console it uses double-quotes. Indeed, double-quoting is the recommended and most common way to construct strings in R.

11.2 Characters and Special Characters

Strings are made up of characters: that’s why R calls them “character vectors.” From your point of view as a speaker of the English language, characters would seem to be the things you would have entered on a typewriter, and which can be entered from your computer keyboard as well:

  • the lower-case letters a-z;
  • the upper case letters A-Z;
  • the digits 0,1, …, 9 (0-9);
  • the punctuation characters: ., -, ?, !, ;, :, etc. (and of course the comma, too!)
  • a few other special-use characters: ~, @, #, $, %, _, +, =, and so on;
  • and the space, too!

All of the above can be part of a string.

But quote-marks (used in quotation and as apostrophes) can also be part of a string:

"Welcome", she said, "the coffee's on me!"

Since quote-marks are used to delimit strings but can also be part of them, designers of programming languages have to think carefully about how to manage quote-marks. Here’s how it works in R:

  • If you choose to delimit a string with double-quotes, then you can put single-quotes anywhere you like within the string and they will be treated by the computer as literal single-quotes, not as string-delimiters. Here is an example:

    cat("'Hello', she said.")
    ## 'Hello', she said.
  • If you delimit with double-quotes and you want to place a double-quote in your string, then you have to escape that double-quote with the backslash character \:

    cat("\"Hello\", she said.")
    ## "Hello", she said.
  • If you choose to delimit a string with single-quotes, then you can put double-quotes anywhere you like within the string and they will be treated by the computer as literal double-quotes, not as string-delimiters.

    cat('"Hello", she said.')
    ## "Hello", she said.
  • If you delimit with single-quotes and you want to place a single-quote in your string, then you have to escape that single-quote:

    cat('\'Hello\', she said.')
    ## 'Hello', she said.

In R and in many other programming languages the backslash \ permits the following character to “escape” any special meaning that is otherwise assigned to it by the language. When we write \" we say that we are “escaping” the double-quote; more precisely, we are escaping the special role of the double-quote as a delimiter for strings.

Of course the foregoing implies that the backslash character has a special role in the language: as an escaping-device. So what can we do if we want a literal backslash in our string? Well, we simply escape it by preceding it with a backslash:

cat("up\\down")
## up\down

Another example:

cat("C:\\\\Inetpub\\\\vhosts\\\\example.com")
## C:\\Inetpub\\vhosts\\example.com

So much for “ordinary” characters. But there are special characters, too, sometimes called control characters, that do not represent written symbols. We have seen a couple of them already; the newline character \n is one:

bye <- "Farewell!\n\n"
cat(bye)
## Farewell!  # first \n moves us to a new line ...
##            # .. which is empty due the next \n

We have also seen the tab-character \t:

cat("First Name\tLast Name")
## First Name   Last Name

Notice that the backslash character is used here to allow the n and t to escape their customary roles as the letters “n” and t respectively.

If you ask R, (try help(Quotes)), you will learn that there are several control characters, including:

Table 11.1: Some control characters.
Character Meaning
\n newline
\r carriage return
\t tab
\b backspace
\a alert (bell)
\f form feed
\v vertical tab

It is worth exploring their effects. Here are a couple of examples27:

cat("Hell\to")
## Hell o
cat("Hell\ro")
## Hell
o

A number of other non-control characters can be generated with the backslash. Unicode characters, for instance, are generated by \u{nnnn}, where the n’s represent hexadecimal digits. Try the following in your console, and see what you get:

cat("\u{2603}")  # the Snowman
## ☃

Or, for something zanier:

cat("Hello\u{202e}there, Friend!")
## Hello‮there, Friend!

11.3 Basic String Operations

We now introduce a few basic operations for examining, splitting and combining strings. Some of the functions we discuss come from the stringr package, which is installed along with the tidyverse but is not automatically attached. Make sure you attach it when you work with strings:

11.3.1 Is and As

Recall from Chapter 2 that as.character() coerces other data types into strings:

## [1] "3.14"
## [1] "FALSE"
## character(0)

Also, is.character() tests whether an object is character-vector:

## [1] FALSE

11.3.2 Number of Characters

How many characters are in the word “hello?” Let’s try:

length("hello")
## [1] 1

Oh, right, strings don’t exist alone: "hello" is actually a character-vector of length 1. Instead we should use the str_sub() function:

str_length("hello")
## [1] 5

11.3.3 Substrings and Trimming

We can pull out pieces of a string with the str_sub() function:

poppins <- "Supercalifragilisticexpialidocious"
str_sub(poppins, start = 10, end = 20)
## [1] "fragilistic"

One can also use substr() to replace part of a string with some other string:

str_sub(poppins, start = 10, end = 20) <- "ABCDEFGHIJK"
poppins
## [1] "SupercaliABCDEFGHIJKexpialidocious"

Don’t forget: “vector-in, vector-out” usually applies:

words <- c("Mary", "Poppins", "practically", "perfect")
str_length(words)
## [1]  4  7 11  7
str_sub(words, 1, 3)
## [1] "Mar" "Pop" "pra" "per"

In practical data-analysis situations you’ll often have to work with strings that include unexpected non-printed characters at the beginning or the end, especially if the string once occurred at the end of a line in a text file. For example, consider:

lastWord <- "farewell\r\n"
str_length(lastWord)
## [1] 10
cat(lastWord)
## farewell

From its display on the console, you might infer that lastWord consists of only the eight characters: f, a, r, e, w, e, l, and l. (You can’t see the carriage return followed by the newline.) But str_length() clearly shows that it’s got two characters after the final “l,” even if you can’t see them.

If you think your strings might contain unnecessary leading or trailing white-space, you can remove it with str_trim():

str_trim(lastWord)
## [1] "farewell"

11.3.4 Changing Cases

You can make all of the letters in a string lowercase:

str_to_lower("My name is Rhonda.")
## [1] "my name is rhonda."

You can make them all uppercase:

str_to_upper("It makes me wanna holler!")
## [1] "IT MAKES ME WANNA HOLLER!"

11.3.5 Splitting Strings

Consider the following character vector that records several dates:

dates <- c("3-14-1963", "04-01-1965", "12-2-1983")

You might want to print them out in some uniform way, using the full name of the month, perhaps. Then you would need to gain access to the elements of each date separately, so that you could transform month-numbers to month-names.

str_split() will do the job for you:

str_split(dates, pattern = "-")
## [[1]]
## [1] "3"    "14"   "1963"
## 
## [[2]]
## [1] "04"   "01"   "1965"
## 
## [[3]]
## [1] "12"   "2"    "1983"

The result is a list with one element for each date in dates. Each element of the list is a character vector containing the elements—month-number, day-number and year—that were demarcated in the original strings by the hyphen -, the value given to the pattern parameter.

If we wish, we may now access the elements of the list and process them in any way we like. We might report the months, for example:

dates %>% 
  str_split(pattern = "-") %>% 
  unlist() %>% 
  .[c(1, 4, 7)] %>% 
  as.numeric() %>% 
  month.name[.]
## [1] "March"    "April"    "December"

(Note the use in the code above of the months.name constant provided by R.)

Sometimes it’s handy to split a string word-by-word:

"you have won the lottery" %>% 
  str_split(pattern = " ") %>% 
  unlist()
## [1] "you"     "have"    "won"     "the"     "lottery"

Of course splitting on the space would not have worked if some of the words had been separated by more than one space:

"you have won the  lottery" %>% # two spaces betwen 'the' and 'lottery'
  str_split(pattern = " ") %>% 
  unlist()
## [1] "you"     "have"    "won"     "the"     ""        "lottery"

We’ll address this problem soon.

In order to split a string into its constituent characters, split on the string with no characters:

"aardvark" %>% 
  str_split(pattern = "") %>% 
  unlist()
## [1] "a" "a" "r" "d" "v" "a" "r" "k"

This would be useful if you wanted to, say, count the number of occurrences of “a” in a word:

"aardvark" %>% 
  str_split(pattern = "") %>% 
  unlist() %>% 
  .[. == "a"] %>% 
  length()
## [1] 3

11.3.6 Pasting and Joining Strings

We are already familiar with paste(), which allows us to paste together the arguments that are passed to it:

paste("Mary", "Poppins")
## [1] "Mary Poppins"

By default paste() separates the input strings with a space, but you can control this with the sep parameter:

paste("Mary", "Poppins", sep = ":")
## [1] "Mary:Poppins"
paste("Yabba","dabba","doo!", sep = "")
## [1] "Yabbadabbadoo!"

If you want the separator to be the empty string by default, then you could use paste0():

paste0("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"

The stringr version of paste0() is str_c():

str_c("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"

What if you had a character-vector whose elements you wanted to paste together? For example, consider:

poppins <- c("practically", "perfect", "in",
             "every", "way")

Now suppose you want to paste the elements of poppins together into one string where the words are separated by spaces. str_c() will do the job for you, if you use its collapse parameter:

poppins %>% 
  str_c(collapse = " ")
## [1] "practically perfect in every way"

We’ll call this process joining.

In an atomic vector all of the elements have to be of the same data type (all character, all numerical, etc.). What if you want to join objects of different types? If there are only a few, feel free to type them in as separate arguments to str_c():

str_c("March", 14, 1963, sep = " ")
## [1] "March 14 1963"

If the objects are many, then you could arrange for them appear as the elements of a list:

list("Mary", 343, "Poppins", FALSE) %>% 
  str_c(collapse = " ")
## [1] "Mary 343 Poppins FALSE"

Joining appears to be the opposite of splitting, but in R that’s not quite so. Suppose, for instance, that you have dates where the month, day and year are separated by hyphens and you want to replace the hyphens with forward slashes:

3-14-1963  # you have this
3/14/1963  # you want this

You could try this:

"3-14-1963" %>% 
  str_split(pattern = "-") %>% 
  str_c(collapse = "/")
## [1] "c(\"3\", \"14\", \"1963\")"

That’s not what we want. We have to remember than the result of applying str_split() is a list:

"3-14-1963" %>% 
  str_split(pattern = "-")
## [[1]]
## [1] "3"    "14"   "1963"

We need to unlist prior to the join. The correct procedure is:

"3-14-1963" %>% 
  str_split(pattern = "-") %>% 
  unlist() %>% 
  str_c(collapse = "/")
## [1] "3/14/1963"

Now all is well. Soon, though, we’ll learn a superior method for performing substitutions in strings.

11.4 Formatted Printing

Quite often when we are printing out to the console we want each line to follow some uniform format. This can be accomplished with the sprintf() function.28 Lets begin with an example:

first <- "Mary"
last <- "Poppins"
sprintf(fmt = "%10s%20s", first, last)
## [1] "      Mary             Poppins"

sprintf() builds a string from the strings first and last that were passed to it. The fmt parameter is a string that encodes the format of the result. In this example, the command comes down to:

  • create a string of width 10, consisting of five spaces followed by the five characters of “Mary”
  • create a string of width 20, consisting of 13 spaces followed by the seven characters of “Poppins”
  • The preceding two strings are called fields. We then join the above the fields, with nothing between them.

Here is the result, cated out:

sprintf(fmt = "%10s%20s", first, last) %>% 
  cat()
##       Mary             Poppins

The “s” in the the fmt argument is called a conversion character. It tells sprintf() to expect a string. Each percent sign indicates the beginning of a new field. For each field, the desired field-width should appear between the percent-sign and the conversion character for the field.

In the text above, the names are right-justified, meaning that they appear at the end of their respective fields. If you want a field to be left-justified, insert a hyphen anywhere between the percent sign and the conversion character, like so:

# left-justify both fields:
sprintf(fmt = "%-10s%-20s", first, last) %>% cat()
## Mary      Poppins

Other common conversion characters are:

  • d: an integer
  • f: a decimal number (default is 6 digits precision)
  • g: a decimal number where the default precision is determined by the number of significant figures in the given number

Here is another example:

sprintf(fmt = "%-10s%-10d%-10f", "Mary", 1955, 3.2) %>% cat()
## Mary      1955      3.200000

The following example is the same as above, except that we retain only the significant figures in the 3.2:

sprintf(fmt = "%-10s%-10d%-10g", "Mary", 1955, 3.2) %>% cat()
## Mary      1955      3.2

When you are creating a field for a decimal number, you can specify both the total field-width and the precision together if you separate them with a .. Thus, if you want the number 234.5647 to appear right-justified in a field of width 10, showing only the first three decimal places, then try:

sprintf(fmt = "%-10s%-10d%-10.3f", "Mary", 1955, 234.5647) %>% 
  cat()
## Mary      1955      234.565

sprintf() comes in handy when you want your output to appear in nicely-aligned, tabular fashion. Consider this example:

# information for three people:
firstName <- c("Donald", "Gina", "Rohini")
lastName <- c("Duck", "Gentorious", "Lancaster")
age <- c(17, 19, 20)
gpa <- c(3.7, 3.9, 3.823)
for (i in 1:3) {
  sprintf("%-15s%-20s%-5d%-5.2f\n", 
          firstName[i], lastName[i], age[i], gpa[i]) %>% 
    cat()
}
## Donald         Duck                17   3.70 
## Gina           Gentorious          19   3.90 
## Rohini         Lancaster           20   3.82

Note the use of “\n” in the fmt argument to ensure that the output appears on separate lines.

You could take advantage of vectorization to avoid the loop:

sprintf("%-15s%-20s%5-d%-5.2f\n", 
        firstName, lastName, age, gpa) %>% 
  cat()
## Donald         Duck                17   3.70 
##  Gina           Gentorious          19   3.90 
##  Rohini         Lancaster           20   3.82

Well, that’s not quite right: the second and third lines begin with a space. This happens because cat() separates its input with a space by default. You can prevent this, however, with the sep parameter of cat():

sprintf("%-15s%-20s%-5d%-5.2f\n", 
        firstName, lastName, age, gpa) %>% 
  cat(sep = "")
## Donald         Duck                17   3.70 
## Gina           Gentorious          19   3.90 
## Rohini         Lancaster           20   3.82

Glossary

String

A sequence of characters.

Control Character

A member of a character set that does not represent a written symbol.

Unicode

A computing-industry standard for the consistent encoding of text in most of the world’s written languages.

Exercises

  1. Write a function called revStr() that reverses the characters of any string that it is given. The function should take a single parameter:

    • str: a character-vector of length 1 (a single string).

    Typical examples of use should look like this:

    revStr(str = "goodbye")
    ## [1] "eybdoog"

    Hint: Let’s think about how to solve the reversal problem for a specific string, e.g.:

    str <- "goodbye"

    First, we could turn the string into a list whose only element is the vector of the characters of the string, as follows:

    splitString <- str_split(str, pattern = "")
    splitString
    ## [[1]]
    ## [1] "g" "o" "o" "d" "b" "y" "e"

    This could be turned into just the desired vector with the unlist() function:

    unlist(splitString)
    ## [1] "g" "o" "o" "d" "b" "y" "e"

    Next, recall that R has a function rev() that, when given a vector, returns a vector with the elements in reverse order:

    rev(unlist(splitString))
    ## [1] "e" "y" "b" "d" "o" "o" "g"

    Finally, we would need to convert the reversed vector back into a single string. You have learned a stringr function that will accomplish this.

    After you have solved the problem for the specific vector str, encapsulate your work into the function revStr().

  2. A string is said to be a palindrome if it is the same no matter whether it is spelled backwards or forwards. Write a function called palindromeStr() that determines whether or not a given string is a palindrome. The function should take a single parameter:

    • str: a character-vector of length 1 (a single string).

    It should return TRUE if str is a palindrome, and return FALSE otherwise. Typical example of use should look like this:

    palindromeStr(str = "abba")
    ## [1] TRUE
    palindromeStr("hello")
    ## [1] FALSE

    Hint; Again, you should begin by solving the problem on a specific vector, and only then encapulate your work into a function. To solve the specific problem, you might use the function revStr() from the previous problem. Another possibility is to use rev() along with the function all() that you met in Chapter 2.

  3. Write a function called subStrings() that returns a vector of the substrings of a given string that have at least a given number of characters. The function should take two arguments:

    • str: a character-vector of length 1 (a single string);
    • n: the minimum number of characters a substring should have in order to be included in the vector.

    Validate the input: if the argument for n is less than 1 or greater than the number of characters in str, then the function should advise the user and cease execution. Typical examples of use should look like this (although it is OK if your output-vector contains the sub-strings in a different order):

    subStrings("hello", 3)
    ## [1] "hello" "hell"  "ello"  "hel"   "ell"   "llo"
    subStrings("hello", 6)
    ## n should be at least 1 and no more than the number
    ## of characters in str.

    Hint: Begin by writing a function, called perhaps subStringFixed(), that when given a string and a specific number \(n\), returns all of the substrongs of length exactly \(n\). It might work like this:

    subStringFixed(str = "yabbadabbadoo!", n = 6)
    ## [1] "yabbad" "abbada" "bbadab" "badabb" "adabba" "dabbad" "abbado" "bbadoo" "badoo!"
    subStringFixed(str = "yabbadabbadoo!", n = 0)
    ## n should be at least 1 and no more than the number
    ## of characters in str.
  4. Write a function called subPalindrome() that, for any given string and specified number \(n\), returns a character vector of all the substrings of the string having at least \(n\) characters that are also palindromes. The function should take two arguments:

    • str: a character-vector of length 1 (a single string);
    • n: the minimum number of characters a substring should have in order to be included in the vector.

    Validate the input: if the argument for n is less than 1 or greater than the number of characters in str, then the function should advise the user and cease execution. Typical examples of use should look like this (although it is OK if your output-vector contains the palindromes in a different order):

    ## Note that palindrome substrings are repeated as many times
    ## as they occur in the given string:
    subPalindrome("yabbadabbadoo!", 2)
    ##  [1] "abbadabba" "bbadabb"   "dabbad"    "badab"     "abba"      "abba"      "ada"      
    ##  [8] "bb"        "bb"        "oo"
    subPalindrome("yabbadabbadoo!", 10)
    ## character(0)
    subPalindrome("yabbadabbadoo!", 0)
    ## n should be at least 1 and no more than the number
    ## of characters in str.
  5. Write a function called m111Report() that performs formatted printing from the data frame m111survey in the bcscr package. Given a vector of row numbers, the function will print out the sex, feeling about weight, and GPA of the corresponding individuals. Thus each row in the printout will correspond to an individual in the study. Each row will consist of three fields:

    • The first field is 10 characters wide, and contains either “male” or “female,” followed by the appropriate number of spaces.
    • The first field is 15 characters wide, and contains either “underweight” or “about right” or “overweight,” followed by the appropriate number of spaces.
    • The third field is 5 characters wide, and contains an appropriate number of spaces followed by the grade-point average showing only the first two decimal places. This, if a person’s GPA is recorded as 2.714 then the field will be " 2.71". (Note that, with the space and the decimal point, the total number of characters is 5, as required.)

    A typical example of use is as follows:

    m111Report(c(2, 10, 15))
    ## male      about right     2.50
    ## female    overweight        NA
    ## male      underweight     3.20

    Note that you will have to re-code the feelings about weight.