11 Strings
In this chapter we will take a closer look at charactervectors, and in particular at character vectors of length one, which are commonly called “strings.” The ability to manipulate strings is the foundation for all textprocessing in computer programming.
11.1 Character Vectors: Strings
Computers work at least as much with text as they do with numbers. In computer science the values that refer to text are called strings.
In R, as in most other programming languages, we use quotes as delimiters, meaning that they mark the beginning and the end of strings. Recall that in R, strings are of type character
. For example:
greeting < "hello"
typeof(greeting)
## [1] "character"
Of course, a single string does not exist on its own in R. Instead it exists as the only element of a charactervector of length 1.
is.vector(greeting)
## [1] TRUE
length(greeting)
## [1] 1
To make strings we can use double quotes or single quotes. Since the stringvalue does not include the quotes themselves but only what appears between them, it does not make any difference which type of quotes we use:
greeting1 < "hello"
greeting2 < 'hello'
greeting1 == greeting2
## [1] TRUE
When we make a character vector of length greater than one, we can even use both single and double quotes:
politeWords < c("Please?", 'Thank you!')
politeWords
## [1] "Please?" "Thank you!"
Notice that when R prints politeWords
to the console it uses doublequotes. Indeed, doublequoting is the recommended and most common way to construct strings in R.
11.2 Characters and Special Characters
Strings are made up of characters: that’s why R calls them “character vectors.” From your point of view as a speaker of the English language, characters would seem to be the things you would have entered on a typewriter, and which can be entered from your computer keyboard as well:
 the lowercase letters az;
 the upper case letters AZ;
 the digits 0,1, …, 9 (09);
 the punctuation characters: ., , ?, !, ;, :, etc. (and of course the comma, too!)
 a few other specialuse characters: ~, @, #, $, %, _, +, =, and so on;
 and the space, too!
All of the above can be part of a string.
But quotemarks (used in quotation and as apostrophes) can also be part of a string:
"Welcome", she said, "the coffee's on me!"
Since quotemarks are used to delimit strings but can also be part of them, designers of programming languages have to think carefully about how to manage quotemarks. Here’s how it works in R:

If you choose to delimit a string with doublequotes, then you can put singlequotes anywhere you like within the string and they will be treated by the computer as literal singlequotes, not as stringdelimiters. Here is an example:
cat("'Hello', she said.")
## 'Hello', she said.

If you delimit with doublequotes and you want to place a doublequote in your string, then you have to escape that doublequote with the backslash character
\
:cat("\"Hello\", she said.")
## "Hello", she said.

If you choose to delimit a string with singlequotes, then you can put doublequotes anywhere you like within the string and they will be treated by the computer as literal doublequotes, not as stringdelimiters.
cat('"Hello", she said.')
## "Hello", she said.

If you delimit with singlequotes and you want to place a singlequote in your string, then you have to escape that singlequote:
cat('\'Hello\', she said.')
## 'Hello', she said.
In R and in many other programming languages the backslash \
permits the following character to “escape” any special meaning that is otherwise assigned to it by the language. When we write \"
we say that we are “escaping” the doublequote; more precisely, we are escaping the special role of the doublequote as a delimiter for strings.
Of course the foregoing implies that the backslash character has a special role in the language: as an escapingdevice. So what can we do if we want a literal backslash in our string? Well, we simply escape it by preceding it with a backslash:
cat("up\\down")
## up\down
Another example:
cat("C:\\\\Inetpub\\\\vhosts\\\\example.com")
## C:\\Inetpub\\vhosts\\example.com
So much for “ordinary” characters. But there are special characters, too, sometimes called control characters, that do not represent written symbols. We have seen a couple of them already; the newline character \n
is one:
bye < "Farewell!\n\n"
cat(bye)
## Farewell! # first \n moves us to a new line ...
## # .. which is empty due the next \n
We have also seen the tabcharacter \t
:
cat("First Name\tLast Name")
## First Name Last Name
Notice that the backslash character is used here to allow the n
and t
to escape their customary roles as the letters “n” and t
respectively.
If you ask R, (try help(Quotes)
), you will learn that there are several control characters, including:
Character  Meaning 

\n  newline 
\r  carriage return 
\t  tab 
\b  backspace 
\a  alert (bell) 
\f  form feed 
\v  vertical tab 
It is worth exploring their effects. Here are a couple of examples^{27}:
cat("Hell\to")
## Hell o
cat("Hell\ro")
## Hell
o
A number of other noncontrol characters can be generated with the backslash. Unicode characters, for instance, are generated by \u{nnnn}
, where the n’s represent hexadecimal digits. Try the following in your console, and see what you get:
cat("\u{2603}") # the Snowman
## ☃
Or, for something zanier:
cat("Hello\u{202e}there, Friend!")
## Hellothere, Friend!
11.3 Basic String Operations
We now introduce a few basic operations for examining, splitting and combining strings. Some of the functions we discuss come from the stringr package, which is one of the packages is attached when you library()
the tidyverse.
11.3.1 is.
and as.
Recall from Chapter 2 that as.character()
coerces other data types into strings:
as.character(3.14)
## [1] "3.14"
as.character(FALSE)
## [1] "FALSE"
as.character(NULL)
## character(0)
Also, is.character()
tests whether an object is a charactervector:
is.character(3.14)
## [1] FALSE
11.3.2 Number of Characters
How many characters are in the word “hello”? Let’s try:
length("hello")
## [1] 1
Oh, right, strings don’t exist alone: "hello"
is actually a charactervector of length 1. Instead we should use the str_length()
function:
str_length("hello")
## [1] 5
11.3.3 Substrings and Trimming
We can pull out pieces of a string with the str_sub()
function:
poppins < "Supercalifragilisticexpialidocious"
str_sub(poppins, start = 10, end = 20)
## [1] "fragilistic"
One can also use str_sub()
to replace part of a string with some other string:
str_sub(poppins, start = 10, end = 20) < "ABCDEFGHIJK"
poppins
## [1] "SupercaliABCDEFGHIJKexpialidocious"
Don’t forget: vectorization frequently applies:
words < c("Mary", "Poppins", "practically", "perfect")
str_length(words)
## [1] 4 7 11 7
str_sub(words, 1, 3)
## [1] "Mar" "Pop" "pra" "per"
In practical dataanalysis situations you’ll often have to work with strings that include unexpected nonprinted characters at the beginning or the end, especially if the string once occurred at the end of a line in a text file. For example, consider:
lastWord < "farewell\r\n"
str_length(lastWord)
## [1] 10
cat(lastWord)
## farewell
From its display on the console, you might infer that lastWord
consists of only the eight characters: f, a, r, e, w, e, l, and l. (You can’t see the carriage return followed by the newline.) But str_length()
clearly shows that it’s got two characters after the final “l”, even if you can’t see them.
If you think your strings might contain unnecessary leading or trailing whitespace, you can remove it with str_trim()
:
str_trim(lastWord)
## [1] "farewell"
11.3.4 Changing Cases
You can make all of the letters in a string lowercase:
str_to_lower("My name is Rhonda.")
## [1] "my name is rhonda."
You can make them all uppercase:
str_to_upper("It makes me wanna holler!")
## [1] "IT MAKES ME WANNA HOLLER!"
11.3.5 Splitting Strings
Consider the following character vector that records several dates:
dates < c("3141963", "04011965", "1221983")
You might want to print them out in some uniform way, using the full name of the month, perhaps. You would then need to access the elements of each date separately, so that you could transform monthnumbers to monthnames.
str_split()
will do the job for you:
str_split(dates, pattern = "")
## [[1]]
## [1] "3" "14" "1963"
##
## [[2]]
## [1] "04" "01" "1965"
##
## [[3]]
## [1] "12" "2" "1983"
The result is a list with one element for each date in dates
. Each element of the list is a character vector containing the elements—monthnumber, daynumber and year—that were demarcated in the original strings by the hyphen 
, the value given to the pattern
parameter.
If we wish, we may now access the elements of the list and process them in any way we like. We might report the months, for example:
dates %>%
str_split(pattern = "") %>%
unlist() %>%
.[c(1, 4, 7)] %>%
as.numeric() %>%
month.name[.]
## [1] "March" "April" "December"
(Note the use in the code above of the months.name
constant provided by R.)
Sometimes it’s handy to split a string wordbyword:
## [1] "you" "have" "won" "the" "lottery"
Of course splitting on the space would not have worked if some of the words had been separated by more than one space:
"you have won the lottery" %>% # two spaces between 'the' and 'lottery'
str_split(pattern = " ") %>%
unlist()
## [1] "you" "have" "won" "the" "" "lottery"
We’ll address this issue soon.
In order to split a string into its constituent characters, split on the string with no characters:
## [1] "a" "a" "r" "d" "v" "a" "r" "k"
This would be useful if you wanted to, say, count the number of occurrences of “a” in a word:
## [1] 3
But stringr has a function to count occurences for you:
str_count("aardvark", pattern = "a")
## [1] 3
11.3.6 Pasting and Joining Strings
We are already familiar with paste()
, which allows us to paste together the arguments that are passed to it:
paste("Mary", "Poppins")
## [1] "Mary Poppins"
By default paste()
separates the input strings with a space, but you can control this with the sep
parameter:
paste("Mary", "Poppins", sep = ":")
## [1] "Mary:Poppins"
paste("Yabba","dabba","doo!", sep = "")
## [1] "Yabbadabbadoo!"
If you want the separator to be the empty string by default, then you could use paste0()
:
paste0("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"
The stringr version of paste0()
is str_c()
:
str_c("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"
What if you had a charactervector whose elements you wanted to paste together? For example, consider:
poppins < c(
"practically", "perfect",
"in", "every", "way"
)
Now suppose you want to paste the elements of poppins
together into one string where the words are separated by spaces. str_c()
will do the job for you, if you use its collapse
parameter:
## [1] "practically perfect in every way"
We’ll call this process joining.
In an atomic vector all of the elements have to be of the same data type (all character, all numerical, etc.). What if you want to join objects of different types? If there are only a few, feel free to type them in as separate arguments to str_c()
:
str_c("March", 14, 1963, sep = " ")
## [1] "March 14 1963"
If the objects are many, then you could arrange for them to appear as the elements of a list:
## [1] "Mary 343 Poppins FALSE"
Joining appears to be the opposite of splitting, but in R that’s not quite so. Suppose, for instance, that you have dates where the month, day and year are separated by hyphens and you want to replace the hyphens with forward slashes:
3141963 # you have this
3/14/1963 # you want this
You could try this:
## [1] "c(\"3\", \"14\", \"1963\")"
That’s not what we want. We have to remember than the result of applying str_split()
is a list:
## [[1]]
## [1] "3" "14" "1963"
We need to unlist prior to the join. The correct procedure is:
## [1] "3/14/1963"
Now all is well. However, in a subsequent Chapter we’ll learn a superior method for performing substitutions in strings.
11.4 Application: Base Conversion
The usual way to represent numbers is in base 10: every digit in the representation of the number is stands for a multiple of a power of 10, and these multiples are summed to make the number itself.
For example, consider the number represented as 1046
.
 The rightmost digit, the
6
in the socalled in the socalled “ones” place, represents \(6 \times 10^0 = 6\).  The digit to its left, the
4
that is in the “tens” place, represents \(4 \times 10^1 = 40\).  The
0
in the “hundreds” place represents \(0 \times 10^2 = 0\).  The
1
in the “thousands” place represents \(1 \times 10^3 = 1000\).  The sum gives us the number itself:
\[1 \times 10^3 + 0 \times 10^2 + 4 \times 10^1 + 6 \times 10^0 = 1046.\]
In base10, the only allowed digits are 0
, 1
, …, 9
, representing respectively the numbers 0, 1, … 9. In base10, no digit is allowed to represent a number larger than \(10  1 = 9\).
Other bases work the same way. For example, in base5 the allowed digits are 0
, 1
, 2
, 3
, and 4
. The number represented in base5 as 2143
is:
\[2 \times 5^3 + 1 \times 5^2 + 4 \times 5^1 + 3 \times 5^0 = 298.\]
Bases higher than 10 are quite possible. A common such base is base16, known as hexadecimal. Its digits are:
0
,1
, …9
,a
,b
,c
,d
,e
, andf
.
The number represented by 321
is:
\[3 \times 16^2 + 2 \times 16^1 + 1 \times 16^0 = 801.\]
The number represented by 88fb
is:
8 * 16^3 + 8 * 16^2 + 15 * 16^1 + 11 * 16^0
## [1] 35067
This is because the digit f
represents 15 and the digit b
represents 11.
Because bases higher than 10 require some characters other than 0
through 9
to represent their digits, we will use strings to represent numbers.
We could write a function to find the value of a number represented in base16:
from_16 < function(str) {
values < 0:15
names(values) < c(0:9, letters[1:6])
digits < str_split(str, pattern = "") %>% unlist()
powers < rev(0:(str_length(str)  1))
sum(values[digits] * 16^powers)
}
Applying this to the number base16 number 321
, we get:
from_16("321")
## [1] 801
We also get:
from_16("88fb")
## [1] 35067
We can generalize this function to arbitrary bases. For simplicity. let’s only work with base 16 or lower:
from_base_rep < function(n, base) {
if (base > 16) stop("use base 16 or less")
digit_list < c(0:9, letters[1:6])[1:base]
values < 0:(base  1)
names(values) < digit_list
digits < str_split(n, pattern = "") %>% unlist()
powers < rev(0:(str_length(n)  1))
sum(values[digits] * base^powers)
}
The value of the base5 number 2143
is found as:
from_base_rep("2143", 5)
## [1] 298
Going the other way—from a number to its representation in a given base—is a bit more involved, but we can get the idea if we work with an example.
Suppose that we wish to represent the number 298 (or, to be precise, the number represented in base10 by 298
) in base5.
We begin by finding the ones place. We do this by dividing 298 by 5, getting the quotient and the remainder. The quotient is:
quot < 298 %/% 5
quot
## [1] 59
There are 59 fives in 298. The remainder is:
rem < 298 %% 5
rem
## [1] 3
The remainder is our immediate concern, as it tells us that 3
should go into the ones place.
Now we think about what goes into the fives place. Well, there is \(298  3 = 295\) left to represent. But since we took away the remainder from 298, we know that what is left must be a multiple of 5, and it fact it is:
295 / 5
## [1] 59
the same as the quotient we found earlier.
so there are 59 fives left to account for. Dividing 59 by 5, we get a new quotient and a new remainder:
quot < 59 %/% 5
quot
## [1] 11
rem < 59 %% 5
rem
## [1] 4
That’s 4 fives, and 11 25s. Putting a 4
in the fives will account for the remainder of 4, leaving 11 25s to be taken care of by the 25place and higher.
So far, our base5 representation is 43
.
We are dealing with 11 25s. Notice that 11 is the quotient when you divide 55 by 5.
What should we put in the 25s place? Since we have 11 25s, it is tempting to put 4 of them—the highest number allowed–in the 25s place. The problem is that this does leave an even multiple of 125s, the next place. But if we follow the pattern of the previous digits, and put the remainder after dividing 11 by 5, then we put 1
in the 25s place and are left with 10 25s, which is the same as 2 125s. This would give us the representation 2143
.
Summarizing:
 295 divided by 5 gives a quotient of 59, and remainder of 3. First digit of the base5 representation was set to
3
.  59 divided by 5 gives a quotient of 11, and a remainder of 4. Second digit of the base5 representation was set to
4
.  11 divided by 5 gives a quotient of 2 and a remainder of 1. Third digit was set to
1
.  2 divided by 5 gives a quotient of 0 and a remainder of 2. Fourth digit was set to
2
. No further division is possible.
The pattern seems to be: as long as your new quotient is at least 5, divide by 5 and put the remainder as the next digit. When the new quotient is less than 5, just put it as the final digit.
We can code this with a while
loop:
to_5 < function(n) {
digits < 0:4
numeral < ""
curr < n
while (curr >= 5) {
quot < curr %/% 5
rem < curr %% 5
numeral < str_c(digits[rem + 1], numeral, sep = "")
curr < quot
}
str_c(digits[curr + 1], numeral, sep = "")
}
Try it out:
to_5(298)
## [1] "2143"
We can generalize to any base. For convenience we will deal only with bases 36 and lower, so that our digits can be drawn from 0
though 9
and the letters a
through z
:
to_base < function(b, n) {
if (b > 36) stop("choose a lower base")
digits < c(0:9, letters)[1:b]
numeral < ""
curr < n
while (curr >= b) {
quot < curr %/% b
rem < curr %% b
numeral < str_c(digits[rem + 1], numeral, sep = "")
curr < quot
}
str_c(digits[curr + 1], numeral, sep = "")
}
Try it out:
## 35067 to hexadecimal:
to_base(16, 35067)
## [1] "88fb"
11.5 Formatted Printing
Quite often when we are printing out to the console we want each line to follow some uniform format. This can be accomplished with the sprintf()
function.^{28} Let’s begin with an example:
first < "Mary"
last < "Poppins"
sprintf(fmt = "%10s%20s", first, last)
## [1] " Mary Poppins"
sprintf()
builds a string from the strings first
and last
that were passed to it. The fmt
parameter is a string that encodes the format of the result. In this example, the command comes down to:
 create a string of width 10, consisting of five spaces followed by the five characters of “Mary”
 create a string of width 20, consisting of 13 spaces followed by the seven characters of “Poppins”
 The preceding two strings are called fields. We then join the above the fields, with nothing between them.
Here is the result, cat
ed out:
## Mary Poppins
The “s” in the the fmt
argument is called a conversion character. It tells sprintf()
to expect a string. Each percent sign indicates the beginning of a new field. For each field, the desired fieldwidth should appear between the percentsign and the conversion character for the field.
In the text above, the names are rightjustified, meaning that they appear at the end of their respective fields. If you want a field to be leftjustified, insert a hyphen anywhere between the percent sign and the conversion character, like so:
## Mary Poppins
Other common conversion characters are:

d
: an integer 
f
: a decimal number (default is 6 digits precision) 
g
: a decimal number where the default precision is determined by the number of significant figures in the given number
Here is another example:
## Mary 1955 3.200000
The following example is the same as above, except that we retain only the significant figures in the 3.2:
## Mary 1955 3.2
When you are creating a field for a decimal number, you can specify both the total fieldwidth and the precision together if you separate them with a .
. Thus, if you want the number 234.5647 to appear rightjustified in a field of width 10, showing only the first three decimal places, then try:
## Mary 1955 234.565
sprintf()
comes in handy when you want your output to appear in nicelyaligned, tabular fashion. Consider this example:
# information for three people:
firstName < c("Donald", "Gina", "Rohini")
lastName < c("Duck", "Gentorious", "Lancaster")
age < c(17, 19, 20)
gpa < c(3.7, 3.9, 3.823)
for (i in 1:3) {
sprintf(
"%15s%20s%5d%5.2f\n",
firstName[i], lastName[i],
age[i], gpa[i]
) %>%
cat()
}
## Donald Duck 17 3.70
## Gina Gentorious 19 3.90
## Rohini Lancaster 20 3.82
Note the use of “\n” in the fmt
argument to ensure that the output appears on separate lines.
You could take advantage of vectorization to avoid the loop:
## Donald Duck 17 3.70
## Gina Gentorious 19 3.90
## Rohini Lancaster 20 3.82
Well, that’s not quite right: the second and third lines begin with a space. This happens because cat()
separates its input with a space by default. You can prevent this, however, with the sep
parameter of cat()
:
## Donald Duck 17 3.70
## Gina Gentorious 19 3.90
## Rohini Lancaster 20 3.82
Glossary
 String

A sequence of characters.
 Control Character

A member of a character set that does not represent a written symbol.
 Unicode

A computingindustry standard for the consistent encoding of text in most of the world’s written languages.
Exercises

Write a function called
revStr()
that reverses the characters of any string that it is given. The function should take a single parameter:
str
: a charactervector of length 1 (a single string).
Typical examples of use should look like this:
revStr(str = "goodbye")
## [1] "eybdoog"
Hint: Let’s think about how to solve the reversal problem for a specific string, e.g.:
str < "goodbye"
First, we could turn the string into a list whose only element is the vector of the characters of the string, as follows:
splitString < str_split(str, pattern = "") splitString
## [[1]] ## [1] "g" "o" "o" "d" "b" "y" "e"
This could be turned into just the desired vector with the
unlist()
function:unlist(splitString)
## [1] "g" "o" "o" "d" "b" "y" "e"
Next, recall that R has a function
rev()
that, when given a vector, returns a vector with the elements in reverse order:## [1] "e" "y" "b" "d" "o" "o" "g"
Finally, we would need to convert the reversed vector back into a single string. You have learned a stringr function that will accomplish this.
After you have solved the problem for the specific vector
str
, encapsulate your work into the functionrevStr()
. 

A string is said to be a palindrome if it is the same no matter whether it is spelled backwards or forwards. Write a function called
palindromeStr()
that determines whether or not a given string is a palindrome. The function should take a single parameter:
str
: a charactervector of length 1 (a single string).
It should return
TRUE
ifstr
is a palindrome, and returnFALSE
otherwise. Typical example of use should look like this:palindromeStr(str = "abba")
## [1] TRUE
palindromeStr("hello")
## [1] FALSE
Hint; Again, you should begin by solving the problem on a specific vector, and only then encapulate your work into a function. To solve the specific problem, you might use the function
revStr()
from the previous problem. Another possibility is to userev()
along with the functionall()
that you met in Chapter 2. 

Write a function called
subStrings()
that returns a vector of the substrings of a given string that have at least a given number of characters. The function should take two arguments:
str
: a charactervector of length 1 (a single string); 
n
: the minimum number of characters a substring should have in order to be included in the vector.
Validate the input: if the argument for
n
is less than 1 or greater than the number of characters instr
, then the function should advise the user and cease execution. Typical examples of use should look like this (although it is OK if your outputvector contains the substrings in a different order):subStrings("hello", 3)
## [1] "hello" "hell" "ello" "hel" "ell" "llo"
subStrings("hello", 6)
## n should be at least 1 and no more than the number ## of characters in str.
Hint: Begin by writing a function, called perhaps
subStringFixed()
, that when given a string and a specific number \(n\), returns all of the substrongs of length exactly \(n\). It might work like this:subStringFixed(str = "yabbadabbadoo!", n = 6)
## [1] "yabbad" "abbada" "bbadab" "badabb" "adabba" "dabbad" "abbado" ## [8] "bbadoo" "badoo!"
subStringFixed(str = "yabbadabbadoo!", n = 0)
## n should be at least 1 and no more than the number ## of characters in str.


Write a function called
subPalindrome()
that, for any given string and specified number \(n\), returns a character vector of all the substrings of the string having at least \(n\) characters that are also palindromes. The function should take two arguments:
str
: a charactervector of length 1 (a single string); 
n
: the minimum number of characters a substring should have in order to be included in the vector.
Validate the input: if the argument for
n
is less than 1 or greater than the number of characters instr
, then the function should advise the user and cease execution. Typical examples of use should look like this (although it is OK if your outputvector contains the palindromes in a different order):## Note that palindrome substrings are repeated as many times ## as they occur in the given string: subPalindrome("yabbadabbadoo!", 2)
## [1] "abbadabba" "bbadabb" "dabbad" "badab" "abba" ## [6] "abba" "ada" "bb" "bb" "oo"
subPalindrome("yabbadabbadoo!", 10)
## character(0)
subPalindrome("yabbadabbadoo!", 0)
## n should be at least 1 and no more than the number ## of characters in str.


Write a function called
m111Report()
that performs formatted printing from the data framem111survey
in the bcscr package. Given a vector of row numbers, the function will print out the sex, feeling about weight, and GPA of the corresponding individuals. Thus each row in the printout will correspond to an individual in the study. Each row will consist of three fields: The first field is 10 characters wide, and contains either “male” or “female”, followed by the appropriate number of spaces.
 The first field is 15 characters wide, and contains either “underweight” or “about right” or “overweight”, followed by the appropriate number of spaces.
 The third field is 5 characters wide, and contains an appropriate number of spaces followed by the gradepoint average showing only the first two decimal places. This, if a person’s GPA is recorded as 2.714 then the field will be ” 2.71”. (Note that, with the space and the decimal point, the total number of characters is 5, as required.)
A typical example of use is as follows:
m111Report(c(2, 10, 15))
## male about right 2.50 ## female overweight NA ## male underweight 3.20
Note that you will have to recode the feelings about weight.