11 Strings
In this chapter we will take a closer look at character-vectors, and in particular at character vectors of length one, which are commonly called “strings.” The ability to manipulate strings is the foundation for all text-processing in computer programming.
11.1 Character Vectors: Strings
Computers work at least as much with text as they do with numbers. In computer science the values that refer to text are called strings.
In R, as in most other programming languages, we use quotes as delimiters, meaning that they mark the beginning and the end of strings. Recall that in R, strings are of type character
. For example:
greeting <- "hello"
typeof(greeting)
## [1] "character"
Of course, a single string does not exist on its own in R. Instead it exists as the only element of a character-vector of length 1.
is.vector(greeting)
## [1] TRUE
length(greeting)
## [1] 1
To make strings we can use double quotes or single quotes. Since the string-value does not include the quotes themselves but only what appears between them, it does not make any difference which type of quotes we use:
greeting1 <- "hello"
greeting2 <- 'hello'
greeting1 == greeting2
## [1] TRUE
When we make a character vector of length greater than one, we can even use both single and double quotes:
politeWords <- c("Please?", 'Thank you!')
politeWords
## [1] "Please?" "Thank you!"
Notice that when R prints politeWords
to the console it uses double-quotes. Indeed, double-quoting is the recommended and most common way to construct strings in R.
11.2 Characters and Special Characters
Strings are made up of characters: that’s why R calls them “character vectors.” From your point of view as a speaker of the English language, characters would seem to be the things you would have entered on a typewriter, and which can be entered from your computer keyboard as well:
- the lower-case letters a-z;
- the upper case letters A-Z;
- the digits 0,1, …, 9 (0-9);
- the punctuation characters: ., -, ?, !, ;, :, etc. (and of course the comma, too!)
- a few other special-use characters: ~, @, #, $, %, _, +, =, and so on;
- and the space, too!
All of the above can be part of a string.
But quote-marks (used in quotation and as apostrophes) can also be part of a string:
"Welcome", she said, "the coffee's on me!"
Since quote-marks are used to delimit strings but can also be part of them, designers of programming languages have to think carefully about how to manage quote-marks. Here’s how it works in R:
-
If you choose to delimit a string with double-quotes, then you can put single-quotes anywhere you like within the string and they will be treated by the computer as literal single-quotes, not as string-delimiters. Here is an example:
cat("'Hello', she said.")
## 'Hello', she said.
-
If you delimit with double-quotes and you want to place a double-quote in your string, then you have to escape that double-quote with the backslash character
\
:cat("\"Hello\", she said.")
## "Hello", she said.
-
If you choose to delimit a string with single-quotes, then you can put double-quotes anywhere you like within the string and they will be treated by the computer as literal double-quotes, not as string-delimiters.
cat('"Hello", she said.')
## "Hello", she said.
-
If you delimit with single-quotes and you want to place a single-quote in your string, then you have to escape that single-quote:
cat('\'Hello\', she said.')
## 'Hello', she said.
In R and in many other programming languages the backslash \
permits the following character to “escape” any special meaning that is otherwise assigned to it by the language. When we write \"
we say that we are “escaping” the double-quote; more precisely, we are escaping the special role of the double-quote as a delimiter for strings.
Of course the foregoing implies that the backslash character has a special role in the language: as an escaping-device. So what can we do if we want a literal backslash in our string? Well, we simply escape it by preceding it with a backslash:
cat("up\\down")
## up\down
Another example:
cat("C:\\\\Inetpub\\\\vhosts\\\\example.com")
## C:\\Inetpub\\vhosts\\example.com
So much for “ordinary” characters. But there are special characters, too, sometimes called control characters, that do not represent written symbols. We have seen a couple of them already; the newline character \n
is one:
bye <- "Farewell!\n\n"
cat(bye)
## Farewell! # first \n moves us to a new line ...
## # .. which is empty due the next \n
We have also seen the tab-character \t
:
cat("First Name\tLast Name")
## First Name Last Name
Notice that the backslash character is used here to allow the n
and t
to escape their customary roles as the letters “n” and t
respectively.
If you ask R, (try help(Quotes)
), you will learn that there are several control characters, including:
Character | Meaning |
---|---|
\n | newline |
\r | carriage return |
\t | tab |
\b | backspace |
\a | alert (bell) |
\f | form feed |
\v | vertical tab |
It is worth exploring their effects. Here are a couple of examples27:
cat("Hell\to")
## Hell o
cat("Hell\ro")
## Hello
A number of other non-control characters can be generated with the backslash. Unicode characters, for instance, are generated by \u{nnnn}
, where the n’s represent hexadecimal digits. Try the following in your console, and see what you get:
cat("\u{2603}") # the Snowman
## ☃
Or, for something zanier:
cat("Hello\u{202e}there, Friend!")
## Hellothere, Friend!
11.3 Basic String Operations
We now introduce a few basic operations for examining, splitting and combining strings. Some of the functions we discuss come from the stringr package, which is one of the packages is attached when you library()
the tidyverse.
11.3.1 is.
and as.
Recall from Chapter 2 that as.character()
coerces other data types into strings:
as.character(3.14)
## [1] "3.14"
as.character(FALSE)
## [1] "FALSE"
as.character(NULL)
## character(0)
Also, is.character()
tests whether an object is a character-vector:
is.character(3.14)
## [1] FALSE
11.3.2 Number of Characters
How many characters are in the word “hello”? Let’s try:
length("hello")
## [1] 1
Oh, right, strings don’t exist alone: "hello"
is actually a character-vector of length 1. Instead we should use the str_length()
function:
str_length("hello")
## [1] 5
11.3.3 Substrings and Trimming
We can pull out pieces of a string with the str_sub()
function:
poppins <- "Supercalifragilisticexpialidocious"
str_sub(poppins, start = 10, end = 20)
## [1] "fragilistic"
One can also use str_sub()
to replace part of a string with some other string:
str_sub(poppins, start = 10, end = 20) <- "ABCDEFGHIJK"
poppins
## [1] "SupercaliABCDEFGHIJKexpialidocious"
Don’t forget: vectorization frequently applies:
words <- c("Mary", "Poppins", "practically", "perfect")
str_length(words)
## [1] 4 7 11 7
str_sub(words, 1, 3)
## [1] "Mar" "Pop" "pra" "per"
In practical data-analysis situations you’ll often have to work with strings that include unexpected non-printed characters at the beginning or the end, especially if the string once occurred at the end of a line in a text file. For example, consider:
lastWord <- "farewell\r\n"
str_length(lastWord)
## [1] 10
cat(lastWord)
## farewell
From its display on the console, you might infer that lastWord
consists of only the eight characters: f, a, r, e, w, e, l, and l. (You can’t see the carriage return followed by the newline.) But str_length()
clearly shows that it’s got two characters after the final “l”, even if you can’t see them.
If you think your strings might contain unnecessary leading or trailing white-space, you can remove it with str_trim()
:
str_trim(lastWord)
## [1] "farewell"
11.3.4 Changing Cases
You can make all of the letters in a string lowercase:
str_to_lower("My name is Rhonda.")
## [1] "my name is rhonda."
You can make them all uppercase:
str_to_upper("It makes me wanna holler!")
## [1] "IT MAKES ME WANNA HOLLER!"
11.3.5 Splitting Strings
Consider the following character vector that records several dates:
dates <- c("3-14-1963", "04-01-1965", "12-2-1983")
You might want to print them out in some uniform way, using the full name of the month, perhaps. You would then need to access the elements of each date separately, so that you could transform month-numbers to month-names.
str_split()
will do the job for you:
str_split(dates, pattern = "-")
## [[1]]
## [1] "3" "14" "1963"
##
## [[2]]
## [1] "04" "01" "1965"
##
## [[3]]
## [1] "12" "2" "1983"
The result is a list with one element for each date in dates
. Each element of the list is a character vector containing the elements—month-number, day-number and year—that were demarcated in the original strings by the hyphen -
, the value given to the pattern
parameter.
If we wish, we may now access the elements of the list and process them in any way we like. We might report the months, for example:
dates %>%
str_split(pattern = "-") %>%
unlist() %>%
.[c(1, 4, 7)] %>%
as.numeric() %>%
month.name[.]
## [1] "March" "April" "December"
(Note the use in the code above of the months.name
constant provided by R.)
Sometimes it’s handy to split a string word-by-word:
## [1] "you" "have" "won" "the" "lottery"
Of course splitting on the space would not have worked if some of the words had been separated by more than one space:
"you have won the lottery" %>% # two spaces between 'the' and 'lottery'
str_split(pattern = " ") %>%
unlist()
## [1] "you" "have" "won" "the" "" "lottery"
We’ll address this issue soon.
In order to split a string into its constituent characters, split on the string with no characters:
## [1] "a" "a" "r" "d" "v" "a" "r" "k"
This would be useful if you wanted to, say, count the number of occurrences of “a” in a word:
## [1] 3
But stringr has a function to count occurences for you:
str_count("aardvark", pattern = "a")
## [1] 3
11.3.6 Pasting and Joining Strings
We are already familiar with paste()
, which allows us to paste together the arguments that are passed to it:
paste("Mary", "Poppins")
## [1] "Mary Poppins"
By default paste()
separates the input strings with a space, but you can control this with the sep
parameter:
paste("Mary", "Poppins", sep = ":")
## [1] "Mary:Poppins"
paste("Yabba","dabba","doo!", sep = "")
## [1] "Yabbadabbadoo!"
If you want the separator to be the empty string by default, then you could use paste0()
:
paste0("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"
The stringr version of paste0()
is str_c()
:
str_c("Yabba","dabba","doo!")
## [1] "Yabbadabbadoo!"
What if you had a character-vector whose elements you wanted to paste together? For example, consider:
poppins <- c(
"practically", "perfect",
"in", "every", "way"
)
Now suppose you want to paste the elements of poppins
together into one string where the words are separated by spaces. str_c()
will do the job for you, if you use its collapse
parameter:
## [1] "practically perfect in every way"
We’ll call this process joining.
In an atomic vector all of the elements have to be of the same data type (all character, all numerical, etc.). What if you want to join objects of different types? If there are only a few, feel free to type them in as separate arguments to str_c()
:
str_c("March", 14, 1963, sep = " ")
## [1] "March 14 1963"
If the objects are many, then you could arrange for them to appear as the elements of a list:
## [1] "Mary 343 Poppins FALSE"
Joining appears to be the opposite of splitting, but in R that’s not quite so. Suppose, for instance, that you have dates where the month, day and year are separated by hyphens and you want to replace the hyphens with forward slashes:
3-14-1963 # you have this
3/14/1963 # you want this
You could try this:
## [1] "c(\"3\", \"14\", \"1963\")"
That’s not what we want. We have to remember than the result of applying str_split()
is a list:
## [[1]]
## [1] "3" "14" "1963"
We need to unlist prior to the join. The correct procedure is:
## [1] "3/14/1963"
Now all is well. However, in a subsequent Chapter we’ll learn a superior method for performing substitutions in strings.
11.4 Application: Base Conversion
The usual way to represent numbers is in base 10: every digit in the representation of the number is stands for a multiple of a power of 10, and these multiples are summed to make the number itself.
For example, consider the number represented as 1046
.
- The right-most digit, the
6
in the so-called in the so-called “ones” place, represents \(6 \times 10^0 = 6\). - The digit to its left, the
4
that is in the “tens” place, represents \(4 \times 10^1 = 40\). - The
0
in the “hundreds” place represents \(0 \times 10^2 = 0\). - The
1
in the “thousands” place represents \(1 \times 10^3 = 1000\). - The sum gives us the number itself:
\[1 \times 10^3 + 0 \times 10^2 + 4 \times 10^1 + 6 \times 10^0 = 1046.\]
In base-10, the only allowed digits are 0
, 1
, …, 9
, representing respectively the numbers 0, 1, … 9. In base-10, no digit is allowed to represent a number larger than \(10 - 1 = 9\).
Other bases work the same way. For example, in base-5 the allowed digits are 0
, 1
, 2
, 3
, and 4
. The number represented in base-5 as 2143
is:
\[2 \times 5^3 + 1 \times 5^2 + 4 \times 5^1 + 3 \times 5^0 = 298.\]
Bases higher than 10 are quite possible. A common such base is base-16, known as hexadecimal. Its digits are:
0
,1
, …9
,a
,b
,c
,d
,e
, andf
.
The number represented by 321
is:
\[3 \times 16^2 + 2 \times 16^1 + 1 \times 16^0 = 801.\]
The number represented by 88fb
is:
8 * 16^3 + 8 * 16^2 + 15 * 16^1 + 11 * 16^0
## [1] 35067
This is because the digit f
represents 15 and the digit b
represents 11.
Because bases higher than 10 require some characters other than 0
through 9
to represent their digits, we will use strings to represent numbers.
We could write a function to find the value of a number represented in base-16:
from_16 <- function(str) {
values <- 0:15
names(values) <- c(0:9, letters[1:6])
digits <- str_split(str, pattern = "") %>% unlist()
powers <- rev(0:(str_length(str) - 1))
sum(values[digits] * 16^powers)
}
Applying this to the number base-16 number 321
, we get:
from_16("321")
## [1] 801
We also get:
from_16("88fb")
## [1] 35067
We can generalize this function to arbitrary bases. For simplicity. let’s only work with base 16 or lower:
from_base_rep <- function(n, base) {
if (base > 16) stop("use base 16 or less")
digit_list <- c(0:9, letters[1:6])[1:base]
values <- 0:(base - 1)
names(values) <- digit_list
digits <- str_split(n, pattern = "") %>% unlist()
powers <- rev(0:(str_length(n) - 1))
sum(values[digits] * base^powers)
}
The value of the base-5 number 2143
is found as:
from_base_rep("2143", 5)
## [1] 298
Going the other way—from a number to its representation in a given base—is a bit more involved, but we can get the idea if we work with an example.
Suppose that we wish to represent the number 298 (or, to be precise, the number represented in base-10 by 298
) in base-5.
We begin by finding the ones place. We do this by dividing 298 by 5, getting the quotient and the remainder. The quotient is:
quot <- 298 %/% 5
quot
## [1] 59
There are 59 fives in 298. The remainder is:
rem <- 298 %% 5
rem
## [1] 3
The remainder is our immediate concern, as it tells us that 3
should go into the ones place.
Now we think about what goes into the fives place. Well, there is \(298 - 3 = 295\) left to represent. But since we took away the remainder from 298, we know that what is left must be a multiple of 5, and it fact it is:
295 / 5
## [1] 59
the same as the quotient we found earlier.
so there are 59 fives left to account for. Dividing 59 by 5, we get a new quotient and a new remainder:
quot <- 59 %/% 5
quot
## [1] 11
rem <- 59 %% 5
rem
## [1] 4
That’s 4 fives, and 11 25s. Putting a 4
in the fives will account for the remainder of 4, leaving 11 25s to be taken care of by the 25-place and higher.
So far, our base-5 representation is 43
.
We are dealing with 11 25s. Notice that 11 is the quotient when you divide 55 by 5.
What should we put in the 25s place? Since we have 11 25s, it is tempting to put 4 of them—the highest number allowed–in the 25s place. The problem is that this does leave an even multiple of 125s, the next place. But if we follow the pattern of the previous digits, and put the remainder after dividing 11 by 5, then we put 1
in the 25s place and are left with 10 25s, which is the same as 2 125s. This would give us the representation 2143
.
Summarizing:
- 295 divided by 5 gives a quotient of 59, and remainder of 3. First digit of the base-5 representation was set to
3
. - 59 divided by 5 gives a quotient of 11, and a remainder of 4. Second digit of the base-5 representation was set to
4
. - 11 divided by 5 gives a quotient of 2 and a remainder of 1. Third digit was set to
1
. - 2 divided by 5 gives a quotient of 0 and a remainder of 2. Fourth digit was set to
2
. No further division is possible.
The pattern seems to be: as long as your new quotient is at least 5, divide by 5 and put the remainder as the next digit. When the new quotient is less than 5, just put it as the final digit.
We can code this with a while
-loop:
to_5 <- function(n) {
digits <- 0:4
numeral <- ""
curr <- n
while (curr >= 5) {
quot <- curr %/% 5
rem <- curr %% 5
numeral <- str_c(digits[rem + 1], numeral, sep = "")
curr <- quot
}
str_c(digits[curr + 1], numeral, sep = "")
}
Try it out:
to_5(298)
## [1] "2143"
We can generalize to any base. For convenience we will deal only with bases 36 and lower, so that our digits can be drawn from 0
though 9
and the letters a
through z
:
to_base <- function(b, n) {
if (b > 36) stop("choose a lower base")
digits <- c(0:9, letters)[1:b]
numeral <- ""
curr <- n
while (curr >= b) {
quot <- curr %/% b
rem <- curr %% b
numeral <- str_c(digits[rem + 1], numeral, sep = "")
curr <- quot
}
str_c(digits[curr + 1], numeral, sep = "")
}
Try it out:
## 35067 to hexadecimal:
to_base(16, 35067)
## [1] "88fb"
11.5 Formatted Printing
Quite often when we are printing out to the console we want each line to follow some uniform format. This can be accomplished with the sprintf()
function.28 Let’s begin with an example:
first <- "Mary"
last <- "Poppins"
sprintf(fmt = "%10s%20s", first, last)
## [1] " Mary Poppins"
sprintf()
builds a string from the strings first
and last
that were passed to it. The fmt
parameter is a string that encodes the format of the result. In this example, the command comes down to:
- create a string of width 10, consisting of five spaces followed by the five characters of “Mary”
- create a string of width 20, consisting of 13 spaces followed by the seven characters of “Poppins”
- The preceding two strings are called fields. We then join the above the fields, with nothing between them.
Here is the result, cat
ed out:
## Mary Poppins
The “s” in the the fmt
argument is called a conversion character. It tells sprintf()
to expect a string. Each percent sign indicates the beginning of a new field. For each field, the desired field-width should appear between the percent-sign and the conversion character for the field.
In the text above, the names are right-justified, meaning that they appear at the end of their respective fields. If you want a field to be left-justified, insert a hyphen anywhere between the percent sign and the conversion character, like so:
## Mary Poppins
Other common conversion characters are:
-
d
: an integer -
f
: a decimal number (default is 6 digits precision) -
g
: a decimal number where the default precision is determined by the number of significant figures in the given number
Here is another example:
## Mary 1955 3.200000
The following example is the same as above, except that we retain only the significant figures in the 3.2:
## Mary 1955 3.2
When you are creating a field for a decimal number, you can specify both the total field-width and the precision together if you separate them with a .
. Thus, if you want the number 234.5647 to appear right-justified in a field of width 10, showing only the first three decimal places, then try:
## Mary 1955 234.565
sprintf()
comes in handy when you want your output to appear in nicely-aligned, tabular fashion. Consider this example:
# information for three people:
firstName <- c("Donald", "Gina", "Rohini")
lastName <- c("Duck", "Gentorious", "Lancaster")
age <- c(17, 19, 20)
gpa <- c(3.7, 3.9, 3.823)
for (i in 1:3) {
sprintf(
"%-15s%-20s%-5d%-5.2f\n",
firstName[i], lastName[i],
age[i], gpa[i]
) %>%
cat()
}
## Donald Duck 17 3.70
## Gina Gentorious 19 3.90
## Rohini Lancaster 20 3.82
Note the use of “\n” in the fmt
argument to ensure that the output appears on separate lines.
You could take advantage of vectorization to avoid the loop:
## Donald Duck 17 3.70
## Gina Gentorious 19 3.90
## Rohini Lancaster 20 3.82
Well, that’s not quite right: the second and third lines begin with a space. This happens because cat()
separates its input with a space by default. You can prevent this, however, with the sep
parameter of cat()
:
## Donald Duck 17 3.70
## Gina Gentorious 19 3.90
## Rohini Lancaster 20 3.82
Glossary
- String
-
A sequence of characters.
- Control Character
-
A member of a character set that does not represent a written symbol.
- Unicode
-
A computing-industry standard for the consistent encoding of text in most of the world’s written languages.
Exercises
-
Write a function called
revStr()
that reverses the characters of any string that it is given. The function should take a single parameter:-
str
: a character-vector of length 1 (a single string).
Typical examples of use should look like this:
revStr(str = "goodbye")
## [1] "eybdoog"
Hint: Let’s think about how to solve the reversal problem for a specific string, e.g.:
str <- "goodbye"
First, we could turn the string into a list whose only element is the vector of the characters of the string, as follows:
splitString <- str_split(str, pattern = "") splitString
## [[1]] ## [1] "g" "o" "o" "d" "b" "y" "e"
This could be turned into just the desired vector with the
unlist()
function:unlist(splitString)
## [1] "g" "o" "o" "d" "b" "y" "e"
Next, recall that R has a function
rev()
that, when given a vector, returns a vector with the elements in reverse order:## [1] "e" "y" "b" "d" "o" "o" "g"
Finally, we would need to convert the reversed vector back into a single string. You have learned a stringr function that will accomplish this.
After you have solved the problem for the specific vector
str
, encapsulate your work into the functionrevStr()
. -
-
A string is said to be a palindrome if it is the same no matter whether it is spelled backwards or forwards. Write a function called
palindromeStr()
that determines whether or not a given string is a palindrome. The function should take a single parameter:-
str
: a character-vector of length 1 (a single string).
It should return
TRUE
ifstr
is a palindrome, and returnFALSE
otherwise. Typical example of use should look like this:palindromeStr(str = "abba")
## [1] TRUE
palindromeStr("hello")
## [1] FALSE
Hint; Again, you should begin by solving the problem on a specific vector, and only then encapulate your work into a function. To solve the specific problem, you might use the function
revStr()
from the previous problem. Another possibility is to userev()
along with the functionall()
that you met in Chapter 2. -
-
Write a function called
subStrings()
that returns a vector of the substrings of a given string that have at least a given number of characters. The function should take two arguments:-
str
: a character-vector of length 1 (a single string); -
n
: the minimum number of characters a substring should have in order to be included in the vector.
Validate the input: if the argument for
n
is less than 1 or greater than the number of characters instr
, then the function should advise the user and cease execution. Typical examples of use should look like this (although it is OK if your output-vector contains the sub-strings in a different order):subStrings("hello", 3)
## [1] "hello" "hell" "ello" "hel" "ell" "llo"
subStrings("hello", 6)
## n should be at least 1 and no more than the number ## of characters in str.
Hint: Begin by writing a function, called perhaps
subStringFixed()
, that when given a string and a specific number \(n\), returns all of the substrongs of length exactly \(n\). It might work like this:subStringFixed(str = "yabbadabbadoo!", n = 6)
## [1] "yabbad" "abbada" "bbadab" "badabb" "adabba" "dabbad" "abbado" "bbadoo" "badoo!"
subStringFixed(str = "yabbadabbadoo!", n = 0)
## n should be at least 1 and no more than the number ## of characters in str.
-
-
Write a function called
subPalindrome()
that, for any given string and specified number \(n\), returns a character vector of all the substrings of the string having at least \(n\) characters that are also palindromes. The function should take two arguments:-
str
: a character-vector of length 1 (a single string); -
n
: the minimum number of characters a substring should have in order to be included in the vector.
Validate the input: if the argument for
n
is less than 1 or greater than the number of characters instr
, then the function should advise the user and cease execution. Typical examples of use should look like this (although it is OK if your output-vector contains the palindromes in a different order):## Note that palindrome substrings are repeated as many times ## as they occur in the given string: subPalindrome("yabbadabbadoo!", 2)
## [1] "abbadabba" "bbadabb" "dabbad" "badab" "abba" "abba" "ada" ## [8] "bb" "bb" "oo"
subPalindrome("yabbadabbadoo!", 10)
## character(0)
subPalindrome("yabbadabbadoo!", 0)
## n should be at least 1 and no more than the number ## of characters in str.
-
-
Write a function called
m111Report()
that performs formatted printing from the data framem111survey
in the bcscr package. Given a vector of row numbers, the function will print out the sex, feeling about weight, and GPA of the corresponding individuals. Thus each row in the printout will correspond to an individual in the study. Each row will consist of three fields:- The first field is 10 characters wide, and contains either “male” or “female”, followed by the appropriate number of spaces.
- The first field is 15 characters wide, and contains either “underweight” or “about right” or “overweight”, followed by the appropriate number of spaces.
- The third field is 5 characters wide, and contains an appropriate number of spaces followed by the grade-point average showing only the first two decimal places. This, if a person’s GPA is recorded as 2.714 then the field will be ” 2.71”. (Note that, with the space and the decimal point, the total number of characters is 5, as required.)
A typical example of use is as follows:
m111Report(c(2, 10, 15))
## male about right 2.50 ## female overweight NA ## male underweight 3.20
Note that you will have to re-code the feelings about weight.