7 Data Frames
Can one be a good data analyst without being a halfgood programmer? The short answer to that is, ‘No.’ The long answer to that is, ‘No.’
—Frank HarrellUp to this point we have given a great deal of attention to vectors, and we have always treated them as onedimensional objects: a vector has a length, but not a “width.”
It is time to begin working in two dimensions. In this Chapter we will study matrices, which are simply vectors that have both length and width. Matrices are immensely useful for scientific computation in R, but for the most part we will treat them as a warmup for data frames—the twodimensional Robjects that are especially designed for the storage of data collected in the course of practical data analysis. Once you understand how to construct and manipulate data frames, you will be ready to learn how to visualize and analyze data using R.
7.1 Introduction to Matrices
In R, a matrix is actually an atomic vector—it can only hold one type of element—but with two extra attributes:
 a certain number of rows, and
 a certain number of columns.
One way to create is matrix is to take a vector and give it those two extra attributes, via the matrix()
function. Here is an example:
numbers < 1:24 # this is an ordinary atomic vector
numbersMat < matrix(numbers, nrow = 6, ncol = 4) # make a matrix
numbersMat
## [,1] [,2] [,3] [,4]
## [1,] 1 7 13 19
## [2,] 2 8 14 20
## [3,] 3 9 15 21
## [4,] 4 10 16 22
## [5,] 5 11 17 23
## [6,] 6 12 18 24
Of course if you are making a matrix out of 24 numbers and you know that it’s going to have 6 rows, then you know it must have 4 columns. Similarly, if you know the number of columns then the number of rows is determined. Hence you could have constructed the matrix with just one of the row or column arguments, like this:
numbersMat < matrix(numbers, nrow = 6)
Notice that the numbers went down the first column, then down the second, and so on. If you would rather fill up the matrix rowbyrow, then set the byrow
parameter, which is FALSE
by default, to TRUE
:
matrix(numbers, nrow = 6, byrow = TRUE)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [5,] 17 18 19 20
## [6,] 21 22 23 24
Sometimes we like to give names to our rows, or to our columns, or even to both:
## A B C D
## a 1 7 13 19
## b 2 8 14 20
## c 3 9 15 21
## d 4 10 16 22
## e 5 11 17 23
## f 6 12 18 24
Matrices don’t have to be numerical. They can be character or logical matrices as well:
## [,1] [,2]
## [1,] "Dorothy" "Oz"
## [2,] "Lion" "Toto"
## [3,] "Scarecrow" "Boq"
If you have to spread out the elements of a matrix into a onedimensional vector, you can do so:
as.vector(numbersMat)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
7.1.1 Practice Exercises
Let’s work with the following vector:
dozen < letters[1:12]

Starting with
dozen
write a command that produces the following matrix:## [,1] [,2] [,3] [,4] ## [1,] "a" "d" "g" "j" ## [2,] "b" "e" "h" "k" ## [3,] "c" "f" "i" "l"

Starting with
dozen
write a command that produces the following matrix:## [,1] [,2] [,3] ## [1,] "a" "e" "i" ## [2,] "b" "f" "j" ## [3,] "c" "g" "k" ## [4,] "d" "h" "l"

Starting with
dozen
write a command that produces the following matrix:## [,1] [,2] [,3] ## [1,] "a" "b" "c" ## [2,] "d" "e" "f" ## [3,] "g" "h" "i" ## [4,] "j" "k" "l"

Starting with
dozen
, write commands that produce the following matrix:## c1 c2 c3 ## r1 "a" "b" "c" ## r2 "d" "e" "f" ## r3 "g" "h" "i" ## r4 "j" "k" "l"

Suppose you make the following matrix:
## [,1] [,2] ## [1,] 8 3 ## [2,] 5 4
What’s a oneline command to get the folowing vector from
smallMat
?## [1] 8 5 3 4
nrow()
is a function that, when given a matrix, will tell you the number of rows in that matrix. Write a oneline command to find the number of rows in a matrix calledmysteryMat
.ncol()
is a function that, when given a matrix, will tell you the number of rows in that matrix. Write a oneline command to find the number of columns in a matrix calledmysteryMat
.
7.1.2 Solutions to Practice Exercises

Here’s one way to do it:
matrix(dozen, nrow = 3)
Here’s another way:
matrix(dozen, ncol = 4)

Here’s one way to do it:
matrix(dozen, nrow = 4)

Here’s one way to do it:
matrix(dozen, nrow = 4, byrow = TRUE)

Here’s one way to do it:

Here’s how:
as.vector(smallMat)
The command
nrow(mysteryMat)
will work.The command
ncol(mysteryMat)
will work.
7.2 Matrix Indexing
Matrices are incredibly useful in data analysis, but the primary reason we are talking about them now is to get you used to working in two dimensions. Let’s practice subsetting with matrices.
We use the subsetting operator [
to pick out parts of a matrix. For example, in order to get the element in the second row and third column of numbersMat
, ask for:
numbersMat[2,3]
## [1] 14
The row and column numbers are called indices.
If we want the entire second row, then we could ask for:
numbersMat[2,1:4]
## A B C D
## 2 8 14 20
The result is a onedimensional vector consisting of the elements in the second row of numbersMat
. It inherits as its names the column names of numbersMat
.
Actually, if you want the entire row you don’t have to specify which columns you want. Just leave the spot after the comma empty, like this:
numbersMat[2, ]
## A B C D
## 2 8 14 20
What if you want some items on the second row, but only the items in columns 1, 2 and 4? Then frame your request in terms of a vector of columnindices:
numbersMat[2, c(1, 2, 4)]
## A B D
## 2 8 20
You can specify a vector of rowindices along with a vector of columnindices, if you like:
numbersMat[1:2, 1:3]
## A B C
## a 1 7 13
## b 2 8 14
If the vector has row or column names then you may use them in place of indices to make a selection:
numbersMat[, c("B", "D")]
## B D
## a 7 19
## b 8 20
## c 9 21
## d 10 22
## e 11 23
## f 12 24
You can use subsetting to change the values of the elements of a matrix
numbersMat[2,3] < 0
numbersMat
## A B C D
## a 1 7 13 19
## b 2 8 0 20
## c 3 9 15 21
## d 4 10 16 22
## e 5 11 17 23
## f 6 12 18 24
You can assign a value to an entire row:
numbersMat[2,] < 0
numbersMat
## A B C D
## a 1 7 13 19
## b 0 0 0 0
## c 3 9 15 21
## d 4 10 16 22
## e 5 11 17 23
## f 6 12 18 24
In the code above, the 0 was “recycled” into each of the four elements of the second row
You can assign the elements of a vector to corresponding selected elements of a matrix:
numbersMat[2,] < c(100, 200, 300, 400)
numbersMat
## A B C D
## a 1 7 13 19
## b 100 200 300 400
## c 3 9 15 21
## d 4 10 16 22
## e 5 11 17 23
## f 6 12 18 24
7.2.1 To Drop or Not?
Note that when we ask for a single row of numbersMat
we got a regular onedimensional vector:
numbersMat[3, ]
## A B C D
## 3 9 15 21
The same things happens if we ask for a single column:
numbersMat[ , 2]
## a b c d e f
## 7 200 9 10 11 12
We get the second column of numbersMat
, but as a regular vector. It’s not a “column” anymore. (Note that it inherits the row names from numbersMat
.)
When a subset of a matrix comes from only one row or column, R takes the opportunity to “drop” the class of the subset from “matrix” to “vector.” If you would like the subset to stay a vector, set the drop
parameter, which by default is TRUE
, to FALSE
. Thus the second column of numbersMat
, kept as a matrix with six rows and one column, is found as follows:
numbersMat[ , 2, drop = FALSE]
## B
## a 7
## b 200
## c 9
## d 10
## e 11
## f 12
In most applications people want the simpler vector structure, so they usually leave drop
at its default value.
7.2.2 Practice Exercises
In these exercises we’ll work with the following matrix:
numbers < 1:40
practiceMatrix < matrix(numbers, nrow = 4)
rownames(practiceMatrix) < letters[1:4]
colnames(practiceMatrix) < LETTERS[1:10]
practiceMatrix
## A B C D E F G H I J
## a 1 5 9 13 17 21 25 29 33 37
## b 2 6 10 14 18 22 26 30 34 38
## c 3 7 11 15 19 23 27 31 35 39
## d 4 8 12 16 20 24 28 32 36 40

Write two different oneline commands to get this matrix:
## B C D E ## a 5 9 13 17 ## c 7 11 15 19

Write a oneline command to get this matrix:
## A C E G I ## a 1 9 17 25 33 ## b 2 10 18 26 34 ## c 3 11 19 27 35 ## d 4 12 20 28 36

Write a oneline command to get this vector:
## a b c d ## 1 2 3 4

Write a oneline command to get this vector:
## A B C D E F G H I J ## 2 6 10 14 18 22 26 30 34 38

Write a oneline command to get this matrix:
## A ## a 1 ## b 2 ## c 3 ## d 4

Write a convenient oneline command to get this matrix:
## A B C D E F G H I ## a 1 5 9 13 17 21 25 29 33 ## b 2 6 10 14 18 22 26 30 34 ## c 3 7 11 15 19 23 27 31 35 ## d 4 8 12 16 20 24 28 32 36

Write a convenient oneline command to get this matrix:
## A C D E F G H I ## a 1 9 13 17 21 25 29 33 ## b 2 10 14 18 22 26 30 34 ## c 3 11 15 19 23 27 31 35 ## d 4 12 16 20 24 28 32 36

Write a function called
myRowSums()
that will find the sums of the rows of any given matrix. The function should use afor
loop (see the Chapter on Flow Control). The function should take a single parameter calledmat
, the matrix whose rows the user wishes to sum. It should work like this:myMatrix < matrix(1:24, ncol = 6) myRowSums(mat = myMatrix)
## [1] 66 72 78 84
7.2.3 Solutions to Practice Exercises

Here are two ways:

Here’s one way:
practiceMatrix[ , seq(1, 9, by = 2)]

Here’s one way:
practiceMatrix[ , 1]

Here’s one way:
practiceMatrix[2, ]

Here’s one way:
practiceMatrix[ , 1, drop = FALSE]

Here’s one way:
practiceMatrix[ , 10]

Here’s one way:
practiceMatrix[ , c(2, 10)]

Here is one way to write the function:
7.3 Operations on Matrices
Matrices can be involved in arithmetical and logical operations.
7.3.1 Arithmetical Operations
The usual arithmetic operations apply to matrices, operating elementwise. For example, suppose that we have:
To get the sum of the above two matrices, R adds their corresponding elements and forms a new matrix out of their sums, thus:
mat1 + mat2
## [,1] [,2]
## [1,] 3 3
## [2,] 3 3
R applies recycling as needed. For example, suppose we have:
mat < matrix(1:4, nrow = 2)
mat
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
In order to multiply each element of mat
by 2, we need not create a 2by2 matrix of 2’s. We can simply multiply by 2, and R will take care of recycling the 2:
2 * mat
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Or we could subtract 3 from each element of mat
:
mat  3
## [,1] [,2]
## [1,] 2 0
## [2,] 1 1
7.3.2 Matrix Multiplication
This section is optional reading, but it may interest you if you know about matrix multiplication in linear algebra.
In order to accomplish matrix multiplication, we have to keep in mind that the regular multiplication operator *
works elementwise on matrices, as we have already seen. For matrix multiplication R provides the special operator %*%
. For example, consider the following matrices:
a < matrix(1:6, ncol = 3)
a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1]
## [1,] 2
## [2,] 1
## [3,] 1
Observe that the number of columns of a
is equal to the number of rows of b
. Hence it is possible to form the matrix product a %*% b
:
a %*% b
## [,1]
## [1,] 0
## [2,] 2
As expected, the result is a matrix having as many rows as the rows of a
and as many columns as the columns of b
.
It is also interesting to recall how matrix multiplication works when the second matrix has only one column. The product is obtained by multiplying each column of a
by the element on the corresponding row of b
, and adding the resulting matrices:
b[1,1]*a[ ,1, drop = FALSE] + b[2,1, drop = FALSE]*a[ ,2] + b[3,1]*a[ ,3, drop = FALSE]
## [,1]
## [1,] 0
## [2,] 2
7.3.3 Logical Operations
Boolean operations apply to matrices elementwise, just as they do to ordinary vectors. The result is a matrix of logical values. For examples, consider the original matrix numbersMat
:
numbersMat < matrix(1:24, nrow = 6)
Suppose we wish to determine which elements of numbersMat
are odd. Then we simply ask whether the remainder of an element after division by 2 is equal to 1:
numbersMat %% 2 == 1
## [,1] [,2] [,3] [,4]
## [1,] TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE
## [3,] TRUE TRUE TRUE TRUE
## [4,] FALSE FALSE FALSE FALSE
## [5,] TRUE TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE
We can select elements from a matrix using a Boolean operator, too:
numbersMat[numbersMat %% 2 == 1]
## [1] 1 3 5 7 9 11 13 15 17 19 21 23
Note that the result is an ordinary, onedimensional vector.
7.3.4 Practice Exercises
We’ll work with the following three matrices:
## [,1] [,2]
## [1,] 7 9
## [2,] 4 10
b < matrix(1:4, nrow = 2)
b
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
c < matrix(letters[1:24], nrow = 6, byrow = TRUE)
c
## [,1] [,2] [,3] [,4]
## [1,] "a" "b" "c" "d"
## [2,] "e" "f" "g" "h"
## [3,] "i" "j" "k" "l"
## [4,] "m" "n" "o" "p"
## [5,] "q" "r" "s" "t"
## [6,] "u" "v" "w" "x"

Find a oneline command using
a
that results in:## [,1] [,2] ## [1,] 10 12 ## [2,] 7 13

Find a oneline command using
a
that results in:## [,1] [,2] ## [1,] 14 18 ## [2,] 8 20

Find a oneline command using
a
that results in:## [,1] [,2] ## [1,] 49 81 ## [2,] 16 100

Find a oneline command using
a
andb
that results in:## [,1] [,2] ## [1,] 6 6 ## [2,] 2 6

Describe in words what the following command does:
a > 5
Write a oneline command using
a
that tells you which elements ofa
are one more than a multiple of 3.
Using
c
, write a oneline boolean expression that produces the following:## [,1] [,2] [,3] [,4] ## [1,] FALSE FALSE FALSE FALSE ## [2,] FALSE FALSE FALSE TRUE ## [3,] TRUE TRUE TRUE TRUE ## [4,] TRUE TRUE TRUE TRUE ## [5,] TRUE TRUE TRUE TRUE ## [6,] TRUE TRUE TRUE TRUE
7.3.5 Solutions to Practice Exercises

Here’s one way:
a + 3

Here’s one way:
2 * a

Here’s one way:
a^2
Here’s one way:
```r
a  b
```
It produces a logical matrix of the same dimensions as
a
. The new matrix will haveTRUE
in a cell when the corresponding cell ofa
is greater than 5. Otherwise, the cell will haveFALSE
in it.
Here’s one way:
a %% 3 == 1

Here’s one way:
c >= "h"
7.4 Introduction to Data Frames
R is known as a domainspecific programming language, meaning that although it can in principle perform any sort of computation that a human can perform (given enough pencil, paper and time), it was originally designed to perform tasks in a particular area of application. R’s area of application is data analysis and statistics, especially when performed interactively—i.e., in a setting where the analyst asks for a relatively small computation, examines the results, modifies his or her requests and asks again, and so on.^{23} Although R can be used effectively for a wide range of programming tasks, data analysis is where it really shines.
The data structures of R reflect its orientation to data analysis. We have met a dataoriented structure already—the table, which is one of many convenient ways to display the results of data analysis. For the purpose of organizing data in preparation for analysis, R provides the structure known as the data frame. A data frame facilitates the storage of related data in one location, in a form that makes the most sense to human users.
A data frame is like a matrix in that it is twodimensional—it has rows and columns. Unlike a matrix, though, the elements of a data frame do not have to be all of the same datatype. Each column of a data frame is a vector—of the same length as all the others—but these vectors may be of different types: some numerical, some logical, etc.
7.4.1 Viewing a Data Frame
Let’s take a close look at a data frame: the frame m111survey
, which is available from the bcscr package (White 2021). First let’s attach the package itself:
In the R Studio IDE, we can get a look at the frame in a tab in the Editor pane if we use the View()
function:
View(m111survey)
As with many objects provided by a package, we can get more information about it:
help("m111survey")
From the Help we see that m111survey
records the results of a survey conducted in a number of sections of an elementary statistics course at Georgetown College. From the View we see that the frame is arranged in rows and columns. Each row corresponds to what in data analysis is known as a case or an individual: here, each row goes with a student who participated in the survey. The columns correspond to variables: measurements made on each individual. For a student on a given row, the values in the columns are the values recorded for that student.
When you are not working in R Studio, there are still a couple of way so view the frame. You could print it all out to the console:
m111survey
You could also use the head()
function to view a specified number of initial rows:
head(m111survey, n = 6) # see first six rows
7.4.2 The Stucture of a Data Frame
Further information about the frame may be obtained with the str()
function:
str(m111survey)
## 'data.frame': 71 obs. of 12 variables:
## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
## $ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
## $ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
## $ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
## $ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
## $ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
## $ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
## $ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
## $ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
## $ diff.ideal.act.: num 2 2 NA 3 0 NA 2 3 2 0 ...
The concept of structure extends far beyond the domain of computer programming.^{24} In general the structure of any object consists of:
 the kind of thing that the object is;
 the parts of the object is made up of;
 the relationships between these parts—the rules, if you will, for how the parts work together to make the object do what it does.
In the case of m111survey
the kind of thing this is its class: it’s a data frame.
class(m111survey)
## [1] "data.frame"
Next we see the account of the parts of the object and the way in which the parts relate to one another:
## 71 obs. of 12 variables
From this we know that there are 71 individuals in the study. The data consists of 12 “parts”—the variables—which are related in the sense that they all provide information about the same set of 71 people.
After that the output of str()
launches into an account of the structure of each of the parts, for example:
## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
We are told the kind of thing that height is: it’s a numerical vector (a vector of type double
, in fact). Next we are given the beginning of a statement of its parts: the heights of the individuals. So R is actually giving us the structure of the parts, as well as of the whole m11survey
.
The variable fastest
refers to the fastest speed—in miles per hour—that a person has ever driven a car. Note that it is a vector of type integer
. Officially this is a numerical variable, too, but R is calling attention to the fact that the fastestspeed data is being stored as integers rather than as floatingpoint decimals.
The variables of a data frame are typically associated with the names of the frame:
names(m111survey)
## [1] "height" "ideal_ht" "sleep" "fastest"
## [5] "weight_feel" "love_first" "extra_life" "seat"
## [9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
By means of the names we can isolate a vector in any column, identified in our code in the format frame$variable
. For example, to see the first ten elements of the fastest
variable, we ask for:
m111survey$fastest[1:10]
## [1] 119 110 85 100 95 100 85 160 90 90
In order to compute the mean fastest speed our subjects drove their cars, we can ask for:
mean(m111survey$fastest, na.rm = TRUE)
## [1] 105.9014
If you want to see the speeds that are at least 150 miles per hour, you could ask for:
m111survey$fastest[m111survey$fastest >= 150]
## [1] 160 190
If you worry that the form frame$variable
will require an annoying amount of typing—as seems to be the case in the the example above—then you can use the with()
function:
with(m111survey, fastest[fastest >=150])
## [1] 160 190
It’s instructive to consider how with()
works. If we were to includes the names of the parameters of with()
explicitly, then the call would have looked like this:
with(data = m111survey, expr = fastest[fastest >=150])
For the data
parameter we can supply a data frame or any other Robject that can be used to construct an environment . In this case m111survey
provides a miniature environment consisting of the names of its variables. For the expr
parameter we supply an expression for R to evaluate. As R evaluates the expression, it encounters names (such as fastest
). Now ordinarily R would first search whatever counts as the active environment—in this case it’s the Global Environment—for the names in the expression, but with()
forces R to look first within the environment created by the data
argument. In our example, R finds fastest
inside m111survey
and evaluates the expression on that basis. If it had not found fastest
in m111survey
, R would have moved on to the Global Environment and then the rest of the usual search path and (probably) would have found nothing, causing it to throw an “object not found” error message. In R, as in any other programming language, good programming depends very much on paying attention to how the language searches for the objects to which names refer.
7.4.3 Factors
Some of the variables in m111survey
are called factors; an example is seat
, which pertains to where one prefers to sit in a classroom:
str(m111survey$seat)
## Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
Seating preference is an example of a categorical variable: one whose values are not meaningfully expressed in terms of numbers. When a categorical variable has a relatively small number of possible values, it can be convenient to store its values in a vector of class factor
.
The levels of factor variable are its possible values. In the case of seat
, these are: Front, Middle and Back. As a memorysaving measure, R stores the values in the factor as numbers, where 1 stands for the first level, 2 for the second level, and so on. But please bear in mind that we are dealing with a categorical variable, so the numbers don’t relate to the possible values in any natural way: they are just storage conventions.
It’s possible to create a factor from any type of vector, but most often this is done with a character vector. Suppose for instance, that eight people are asked for their favorite Wizard of Oz character and they answer:
ozFavs < c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
"Glinda", "Scarecrow", "Dorothy")
We can create a factor variable as follows:
factorFavs < factor(ozFavs)
factorFavs
## [1] Glinda Toto Toto Dorothy Toto Glinda Scarecrow Dorothy
## Levels: Dorothy Glinda Scarecrow Toto
Note that the levels are given in alphabetical order: this is the default procedure when R creates a factor. It is possible to ask for a different order, though:
## [1] Glinda Toto Toto Dorothy Toto Glinda Scarecrow Dorothy
## Levels: Toto Scarecrow Glinda Dorothy
In many instances it is appropriate to convert a character vector to a factor, but sometimes this is not such a great idea. Consider something like your address, or your favorite inspirational quote: pretty much every person in a study will have a different address or favorite quote than others in the study. Hence there won’t be any memorystorage benefit associated with creating a factor: the vector of levels—itself a character vector—would require as much storage space as the original character vector itself! In addition, we will see that the status of a variable as class “factor” can affect how R’s statistical and graphical functions deal with it. It’s not a good idea to treat a categorical variable as a factor unless its set of possible values is considered important.
We will think more about how to deal with factor variables later on, when we begin data analysis in earnest.
7.4.4 Practice Exercises
How would you learn more about the data frame
RailTrail
from the mosaicData package?Write a oneline command to see the first 10 rows of
RailTrail
in the Console.Write a oneline command to get the names of all of the variables in
mosaicData::RailTrail
.Regarding
RailTrail
: write a oneline command to get the high temperature on all the days when the precipitation was more than 0.5 inches.Regarding
RailTrail
: write a oneline command to sort the average temperatures from highest to lowest.
7.4.5 Solutions to the Practice Exercises

One way is to attach the package, then then ask for help:
library(mosaicData) help(RailTrail)
Another way is to refer to the data frame through the package, with double colons:
help("RailTrail", package = "mosaicData")
That way you don’t have to add all the items in mosaicData to your search path.

Here’s one way:

Here’s one way:

Try this:
7.5 Creating Data Frames
There are many ways to create data frames in R. Here we will introduce just two ways.
7.5.1 Creation from Vectors
Whenever you have vectors of the same length, you can combine them into a data frame, using the data.frame()
function:
n < c("Dorothy", "Lion", "Scarecrow")
h < c(58, 75, 69)
a < c(12, 0.04, 18)
ozFolk < data.frame(name = n, height = h, age = a)
ozFolk
## name height age
## 1 Dorothy 58 12.00
## 2 Lion 75 0.04
## 3 Scarecrow 69 18.00
Note that at the time of creation you can provide the variables with any names that you like. If later on you change your mind about the names, you can always revise them:
names(ozFolk)
## [1] "name" "height" "age"
names(ozFolk)[2] < "Height" # "height" was at index 2"
ozFolk
## name Height age
## 1 Dorothy 58 12.00
## 2 Lion 75 0.04
## 3 Scarecrow 69 18.00
7.5.2 Creation From Other Frames
If two frames have the same number of rows, you may combine their columns to form a new frame with the cbind()
function:
ozMore < data.frame( color = c("blue", "red", "yellow"),
desire = c("Kansas", "courage", "brains"))
cbind(ozFolk, ozMore)
## name Height age color desire
## 1 Dorothy 58 12.00 blue Kansas
## 2 Lion 75 0.04 red courage
## 3 Scarecrow 69 18.00 yellow brains
Similarly if two data frames have the same number and type of columns then we can use the rbind()
function to combine them:
ozFolk2 < data.frame(
name = c("Toto", "Glinda"),
Height = c(12, 66), age = c(3, 246)
)
rbind(ozFolk, ozFolk2)
## name Height age
## 1 Dorothy 58 12.00
## 2 Lion 75 0.04
## 3 Scarecrow 69 18.00
## 4 Toto 12 3.00
## 5 Glinda 66 246.00
7.6 Subsetting Data Frames
Our study of subsetting matrices can be applied to the selection of parts of a data frame. As with a vector, one or both of the dimensions of the frame can come into play.
We can create a new data frame consisting of any columns we like from the original frame:
## height ideal_ht
## 1 76.0 78
## 2 74.0 76
## 3 64.0 NA
## 4 62.0 65
## 5 72.0 72
## 6 70.8 NA
If we select just one column, then the result is a vector rather than a data frame:
df < m111survey[, "height"]
is.vector(df)
## [1] TRUE
If for some reason you want to prevent this, set drop
to FALSE
:
df < m111survey[, "height", drop =FALSE]
head(df)
## height
## 1 76.0
## 2 74.0
## 3 64.0
## 4 62.0
## 5 72.0
## 6 70.8
You may select particular rows, too:
m111survey[10:15, c("height", "ideal_ht")]
## height ideal_ht
## 10 67 67
## 11 65 69
## 12 62 62
## 13 59 62
## 14 78 75
## 15 69 72
You can even select some of the rows at random. Here is a random sample of size six:
n < nrow(m111survey)
df < m111survey[sample(1:n, size = 6, replace = FALSE), ]
df[c("sex", "seat")] # show just two columns
## sex seat
## 13 female 1_front
## 54 male 2_middle
## 56 male 3_back
## 28 female 1_front
## 53 female 3_back
## 46 female 2_middle
Note the function nrow()
that gives the number of rows of the frame. When we sample six items without replacement from the vector 1:n
, we are picking six numbers at random from the rownumbers of the vector. Specifying these six numbers in the selection operator [
yields the desired random sample of rows.
7.6.1 Boolean Expressions
It is especially common to select rows by the values of a logical vector. For example, to select the rows where the fast speed ever driven is at least 150 miles per hour, try this:
df < m111survey[m111survey$fastest >= 150, ]
df[, c("sex", "fastest")] # show just two of the variables
## sex fastest
## 8 male 160
## 32 male 190
When you are selecting rows it can be convenient to use the subset()
function. The first argument to the function is the frame from which you plan to select, and the second is the Boolean expression by which to select:
## sex fastest
## 8 male 160
## 32 male 190
Note that we did not need to type m111survey$fastest
: the first argument to subset()
provides the environment in which to search for names that appear in the Boolean expression.
The Boolean subsetting expressions can be quite complex:
df < subset(m111survey, seat == "3_back" & height < 72 & sex == "female")
df[, c("sex", "height", "seat")]
## sex height seat
## 9 female 59 3_back
## 20 female 65 3_back
## 30 female 69 3_back
## 53 female 69 3_back
## 70 female 65 3_back
Note: subset()
takes a third parameter called select
that allows you to pick out any desired columns. For example:
subset(m111survey, seat == "3_back" & height < 72 & sex == "female",
select = c("sex", "height", "seat"))
## sex height seat
## 9 female 59 3_back
## 20 female 65 3_back
## 30 female 69 3_back
## 53 female 69 3_back
## 70 female 65 3_back
7.6.2 Practice Exercises
We’ll use the CPS85
data frame from the mosaicData package. You should go ahead and load the package and then read about the data frame:
library(mosaicData)
?CPS85
Each row in the data frame corresponds to an employee in the survey.
Write a command that gives the number of employees in the data frame.
Select the employees who are between 40 and 50 years old.
Select the employees who are married and have fewer than 30 years of experience.
Select the nonunion employees who either live in the South or who have more than 12 years of education (or both).
Select the employees who work in the clerical, construction, management or professional sector.
Select the employees who make more than 30 dollars per hour, and keep only their wage, sex and sector of employment
Select 10 employees at random, keeping only their wage and sex.
Select all of the employees, keeping all information about them except for their union status and whether or not they are from the South.
7.6.3 Solutions to Practice Exercises
The command is
nrow(CPS85)
.
Try this:
subset(CPS85, age > 40 & age < 50)

Try this:
subset(CPS85, married == "Married" & exper < 30)

Try this:
subset(CPS85, union == "Not" & (south == "S"  educ > 12))

Try this:

Try this:
CPS85[CPS85$wage > 30, c("wage", "sex", "sector")]

Try this:

Try this (
south
andunion
are columns 6 and 9, respectively):CPS85[ , c(6, 9)]
The
select
parameter of thesubset()
function has a little known feature that allows you to specify columns to omit by name, so the following is another solution:
7.7 Ordering Data Frames
You can reorder as well as select. For example, the following code selects the first five rows ofm111survey
and then reverses them:
## height ideal_ht
## 5 72 72
## 4 62 65
## 3 64 NA
## 2 74 76
## 1 76 78
If you want, you can even scramble the rows of the data frame in a random order:
n < nrow(m111survey)
shuffle < sample(1:n, size = n, replace = FALSE)
df < m111survey[shuffle, ]
head(df[c("sex", "seat")]) #show just two columns
## sex seat
## 25 female 2_middle
## 51 female 2_middle
## 69 female 1_front
## 52 female 2_middle
## 64 male 3_back
## 13 female 1_front
It is quite common to order the rows of a frame according to the values of a particular variable. For example, you might want to arrange the rows by height
, so that the frame begins with the shortest subject and ends with the tallest.
Accomplishing this task requires a study of R’s order()
function. Consider the following vector:
vec < c(15, 12, 23, 7)
Call order()
with this vector as an argument:
order(vec)
## [1] 4 2 1 3
order()
returns the indices of the elements of vec
, in the following order:
 the index of the smallest element (7, at index 4 of
vec
);  the index of the secondsmallest element (12, at index 2 of
vec
);  the index of the thirdsmallest element (15, at index 1 of
vec
);  the index of the largest element (23, at index 3 of
vec
).
Can you guess the output of the following functioncall without looking for the answer underneath?
vec[order(vec)]
## [1] 7 12 15 23
Sure enough, the result is vec
sorted: from smallest to largest element.
Now the sorting of vec
could have been accomplished with R’s sort()
function:
sort(vec)
## [1] 7 12 15 23
The power of order()
comes with the rearrangement of rows of a data frame. In order to “sort” the frame from shortest to tallest subject, call:
df < m111survey[order(m111survey$height), ]
head(df[, c("sex", "height")]) # to show that it worked
## sex height
## 45 female 51
## 26 female 54
## 9 female 59
## 13 female 59
## 40 female 60
## 69 female 61
If you want to order the rows from tallest to shortest instead, then use the decreasing
parameter, which by default is FALSE
:
df < m111survey[order(m111survey$height, decreasing = TRUE), ]
head(df[, c("sex", "height")]) # to show that it worked
## sex height
## 8 male 79
## 14 female 78
## 1 male 76
## 58 male 76
## 34 male 75
## 54 male 75
Sometimes you want to order by two or more variables. For example suppose you want to arrange the frame so that the folks preferring to sit in front come first, followed by the people who prefer the middle and ending with the people who prefer the back. Within these groups you would like people to be arranged from shortest to tallest. Then call:
ordering < with(m111survey, order(seat, height))
df < m111survey[ordering, ]
head(df[, c("seat", "height")], n = 10) # see if it worked
## seat height
## 45 1_front 51
## 26 1_front 54
## 13 1_front 59
## 69 1_front 61
## 4 1_front 62
## 12 1_front 62
## 23 1_front 63
## 38 1_front 63
## 61 1_front 63
## 57 1_front 64
7.7.1 Practice Exercises

Consider the following vector:
creatures < c("Mole", "Frog", "Rat", "Badger")
Write down what you think will be the result of the call:
order(creatures)
Then check your answer by actually running:

What will be the result of the following?
order(creatures, decreasing = TRUE)
Arrange the rows of the data frame
mosaicData::CPS85
in order, from the lowest to the highest wage. Break ties by experience (less experience coming before more experience).Arrange the rows of the data frame
mosaicData::CPS85
in order, from the lowest to the highest wage. Break ties by experience (more experience coming before less experience).
7.8 New Variables from Old
Quite often you will want to transform one or more variables in a data frame. Transforming a variable means changing its values in a systematic way.
For example, you might want to measure height in feet rather than inches. Then you want the following
heightInFeet < with(m111survey, height/12) # 12 inches in a foot
If you plan to use this new variable in your analysis later on, it might be a good idea to add it to the data frame:
m111survey$height_ft < heightInFeet
Another common need is to recode the values of a categorical variable. For example, you might want to divide people into two groups: those who prefer to sit in the back and those who don’t. This is a good time to use ifelse()
:
seat2 < ifelse(m111survey$seat == "3_back", "Back", "Other")
m111survey$seat2 < seat2
If you plan to recode into a variable that involves more than two values, then you might want to look into the mapvalues()
function from the plyr package (Wickham 2020):
seat3 < plyr::mapvalues(m111survey$seat,
from = c("1_front", "2_middle", "3_back"),
to = c("Front", "Middle", "Back"))
str(seat3)
## Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...
The doityourself approach is to write a loop. Remember switch()
?
seat < m111survey$seat
seat3 < character(length(seat)) # this will be the recoded variable
for ( i in 1:length(seat) ) {
seat3[i] < switch(as.character(seat[i]),
"1_front" = "Front",
"2_middle" = "Middle",
"3_back" = "Back")
}
str(seat3)
## chr [1:71] "Front" "Middle" "Middle" "Front" "Back" "Front" "Front" "Back" "Back" ...
The recoding is done but the result is a character vector and not a factor. We have to make it a factor ourselves:
This seems like a lot of work!
Another common transformation involves turning a numerical variable into a factor. For example, we might need to classify people as:
 Tall (height over 70 inches)
 Medium (65  70 inches)
 Short (less than 65 inches)
The cut()
function will be helpful.
heightClass < cut(m111survey$height,
breaks = c(Inf, 65, 70, Inf),
labels = c("Short", "Medium","Tall"),
right = TRUE)
str(heightClass)
## Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 2 3 1 2 ...
Setting right = TRUE
indicates that the upper bound of each interval is included in the interval. Thus, a person with a height of 70 inches is classed as Medium, not Tall.
7.8.1 Getting Rid of Variables
We have added several variables to m111survey
. In order to remove them (or any other variables we don’t want) we can assign them the value NULL
.
names(m111survey)
## [1] "height" "ideal_ht" "sleep" "fastest"
## [5] "weight_feel" "love_first" "extra_life" "seat"
## [9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
## [13] "height_ft" "seat2" "seat3"
m111survey$height_ft < NULL
m111survey$seat2 < NULL
m111survey$seat3 < NULL
names(m111survey) # the extra variables are gone
## [1] "height" "ideal_ht" "sleep" "fastest"
## [5] "weight_feel" "love_first" "extra_life" "seat"
## [9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
7.8.2 Practice Exercises
Remove the variables
hispanic
andmarried
from themosaicData::CPS85
data frame.Change the units of
wage
inmosaicData::CPS85
from dollars per hour to dollars per day. Assume an eighthour working day.
For
CPS85
, create a new variableexperGrp
that has the following values
low
for experience less than 10 years; 
medium
for experience of at least 10 years but less than 25 years; 
high
for experience at least 25 years.


Using the
experGrp
variable in the previous exercise, create the following tally of the ages of the employees:## experGrp ## low medium high ## 179 217 138
You’ve made some changes to
CPS85
, but in fact you haven’t changed the original data frame in the mosaicData package—you’ve simply made your own copy, which should now be in your Global Environment. Since the Global Environment comes before any package on your search path, if you want to get to the originalCPS85
you will either have to refer to it asmosaicData::CPS85
. Another option, though, is to remove the modified copy from your Global Environment. Go ahead and remove it now.
7.8.3 Solutions to Practice Exercises

Here’s one way to do it:
CPS85$hispanic < NULL CPS85$married < NULL

Here’s one way to do it:
CPS85$wage < CPS85$wage * 8

Here’s one way to do it:

Here’s what to do:
rm(CPS85)
Glossary
 Matrix

An atomic vector that has two additional attributes: a number of rows and a number of columns.
 Data Frame

A twodimensional data structure in R in which the columns are atomic vectors that can be of different types.
 Case (also called an Individual)

An individual unit under study. In a data frame in R, the rows correspond to cases.
 Variable (in Data Analysis)

In data analysis, a variable is a measurement made on the individuals in a study.
 Categorical Variable (in Data Analysis)

In data analysis, a categorical variable is a variable whose values cannot be expressed meaningfully by numbers.
Exercises

R has a function called
t()
that computes the transpose of a given matrix. This means that it switches around the rows and columns of the matrix, like this:myMatrix < matrix(1:24, nrow = 6) myMatrix
## [,1] [,2] [,3] [,4] ## [1,] 1 7 13 19 ## [2,] 2 8 14 20 ## [3,] 3 9 15 21 ## [4,] 4 10 16 22 ## [5,] 5 11 17 23 ## [6,] 6 12 18 24
t(myMatrix)
## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 2 3 4 5 6 ## [2,] 7 8 9 10 11 12 ## [3,] 13 14 15 16 17 18 ## [4,] 19 20 21 22 23 24
Write your own function called
transpose()
that will perform the same task on any given matrix. The function should take a single parameter calledmat
, the matrix to be transposed. Of course you may NOT uset()
in the code for your function!Hints: Your function will have to:
 break
mat
down into the vector of its elements, and then  build the new matrix from those elements, with a number of rows equal to the number of columns of
mat
.
For the first task,
as.vector()
will be useful.For the second task, recall (see previous Practice Exercises from this Chapter) that there is a function
nrow()
that returns the number of rows of a given matrix. It will also be helpful to remember the functionncol()
that computes the number of columns of a given matrix.  break

R has functions called
rowSums()
andcolSums()
that will respectively sum the rows and the columns of a matrix. Here is an example:## [1] 40 44 48 52 56 60
Your task is to write your own function called
dimSum()
that will sum either the rows or the columns of a given matrix. The function should have two parameters:
mat
: the matrix to be summed. 
dim
: the dimension to sum along, either rows or columns. The default value should be"rows"
. If the user setsdim
to"columns"
then the function would compute the columnsums.
You may NOT use
rowSums()
orcolSums()
in the code for your function. A typical example of use should look like this:myMatrix < matrix(1:24, nrow = 6) dimSum(myMatrix)
## [1] 40 44 48 52 56 60
dimSum(myMatrix, "columns")
## [1] 21 57 93 129
Hint: Recall that in previous Practice Exercises of this Chapter we made a function called
myRowSums()
that sums the rows of any given matrix. Modify the idea formyRowSums()
to write a function calledmyColSums()
that finds the columnsums of any given matrix. You may then use the two previouslycreated functions to write the required functiondimsum()
. 
Starting with
m111survey
in the bcscr package, write the code necessary to create a new data frame calledsmaller
that consists precisely of the male students who believe in extraterrestrial life and who are more than 68 inches tall. The new data frame should contain all of the original variables except forsex
andextra_life
.
Write a function called
dfRandSelect()
that randomly selects (without replacement) a specified number of rows from a given data frame. The function should have two parameters:
df
: the data frame from which to select; 
n
: the number of rows to select.
If
n
is greater than the number of rows indf
, the function should return immediately with a message informing the user that the required task is not possible and informing him/her of the number of rows indf
. Typical examples of use should be as follows:dfRandSelect(bcscr::fuel, 5)
## speed efficiency ## 12 120 9.87 ## 15 150 12.83 ## 7 70 6.30 ## 6 60 5.90 ## 8 80 6.95
dfRandSelect(bcscr::fuel, 200)
## No can do! The frame has only 15 rows.
Hint: Use the function
nrow()
, which gives the number of rows of a matrix or data frame. 

(*) Create your own data frame, named
myFrame
. The frame should have 100 rows, along with the following variables:
lowerLetters
: a character vector randomlyproduced 3letter strings, like “chj,” “bbw,” and so on. The letters should all be lowercase. 
height
: a numerical vector consisting of real numbers chosen randomly between the values of 60 and 75. 
sex
: a factor whose possible value are “female” and “male.” Again, these values should be chosen randomly.
A call to
str(myFrame)
would come out like this (although your results will vary a bit since the vectors are constructed randomly):str(myFrame)
## 'data.frame': 100 obs. of 3 variables: ## $ lowerLetters: chr "usu" "uhl" "xyj" "uyd" ... ## $ height : num 73.7 72.4 73.8 65.2 61.3 ... ## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 2 1 1 1 ...
summary()
is useful when working with data frames. Here is how a call tosummary(myFrame)
might look:summary(myFrame)
## lowerLetters height sex ## Length:100 Min. :60.00 female:57 ## Class :character 1st Qu.:63.63 male :43 ## Mode :character Median :68.28 ## Mean :67.62 ## 3rd Qu.:71.63 ## Max. :74.57
Hint: If you have a vector of three letters, such as
vec < c("g", "a", "r")
then you can paste them together as follows:
paste0(vec, collapse = "")
## [1] "gar"

(*) Study the data frame
fuel
in the bcscr package. Note that the fuel efficiency is reported as the number of liters of fuel required to travel 100 kilometers. Look up the conversion between gallons and liters and between kilometers and miles, and use this information to create a new variable calledmpg
that gives the fuel efficiency as miles per gallon. While you are at it, create a new variablemph
that gives the speed in miles per hour. Finally, add these new variables to thefuel
data frame.
(*) Use matrices to generalize the simulation in the Appeals Court Paradox (see Section 6.5). Your goal is to write a simulation function called
appealsSimPlus()
that comes with all the options provided in the text, but with additional parameters so that the user can choose: the number of judges on the court;
 the probability for each judge to make a correct decision;
 the voting pattern (how many votes each judge gets).
A typical call to the functions should look like this:
appealsSimPlus(reps = 10000, seed = 5252, probs = c(0.95, 0.90, 0.90, 0.90, 0.80), votes = c(2, 1, 1, 1, 0))
In the above call the court consists of five judges. The best one decides cases correctly 95% of the time, three are right 90% of the time and one is right 80%of the time. The voting arrangement is that the best judge gets two votes, the next three get one vote each, and the worst gets no vote. Any voting scheme—even a scheme involving fractional votes—should be allowed so long as the votes add up to the number of judges.
Here is a hint. When you write the function it may be helpful to use the fact that
rbinom()
can take aprob
parameter that is a vector of any length. Here’s an example:## [1] 20 49 94 15 50 88
The first and fourth entries simulate a person tossing a fair coin 100 times when she has only a 10% chance of heads. The second and fifth entries simulate the same, when the chance of heads is 50%. The third and sixth simulate cointossing when there is a 90% chance of heads.
If you would like to arrange the results more nicely—say in a matrix where each column gives the results for a different person—you can do so:
resultsMat < matrix(results, ncol = 3, byrow = TRUE) resultsMat
## [,1] [,2] [,3] ## [1,] 20 49 94 ## [2,] 15 50 88
Of course judges don’t flip a coin 100 times, they decide one case at a time. Suppose you have five judges with probabilities as follows:
probCorrect < c(0.95, 0.90, 0.90, 0.90, 0.80)
If you would like to simulate the judges deciding, say, 6 cases, try this:
results < rbinom(5*6, size = 1, prob= rep(probCorrect, 6)) resultsMat < matrix(results, nrow = 6, byrow = TRUE) resultsMat
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 0 1 ## [2,] 0 1 1 1 1 ## [3,] 1 1 1 1 1 ## [4,] 1 1 1 1 1 ## [5,] 1 1 1 1 1 ## [6,] 1 1 1 1 0
When it comes to applying the voting pattern to compute the decision in each case, consider matrix multiplication. For example, suppose that the pattern is:
votes < c(2, 1, 1, 1, 0)
Then make
votes
a onecolumn matrix and perform matrix multiplication:correctVotes < resultsMat %*% matrix(votes, nrow = 5) correctVotes
## [,1] ## [1,] 4 ## [2,] 3 ## [3,] 5 ## [4,] 5 ## [5,] 5 ## [6,] 5
Think about how to encapsulate all of this into a nice, general simulation function.