Can one be a good data analyst without being a half-good programmer? The short answer to that is, ‘No.’ The long answer to that is, ‘No.’
—Frank Harrell7 Data Frames
Up to this point we have given a great deal of attention to vectors, and we have always treated them as one-dimensional objects: a vector has a length, but not a “width.”
It is time to begin working in two dimensions. In this Chapter we will study matrices, which are simply vectors that have both length and width. Matrices are immensely useful for scientific computation in R, but for the most part we will treat them as a warm-up for data frames—the two-dimensional R-objects that are especially designed for the storage of data collected in the course of practical data analysis. Once you understand how to construct and manipulate data frames, you will be ready to learn how to visualize and analyze data using R.
7.1 Introduction to Matrices
In R, a matrix is actually an atomic vector—it can only hold one type of element—but with two extra attributes:
- a certain number of rows, and
- a certain number of columns.
One way to create is matrix is to take a vector and give it those two extra attributes, via the matrix()
function. Here is an example:
<- 1:24 # this is an ordinary atomic vector
numbers <- matrix(numbers, nrow = 6, ncol = 4) # make a matrix
numbersMat numbersMat
[,1] [,2] [,3] [,4]
[1,] 1 7 13 19
[2,] 2 8 14 20
[3,] 3 9 15 21
[4,] 4 10 16 22
[5,] 5 11 17 23
[6,] 6 12 18 24
Of course if you are making a matrix out of 24 numbers and you know that it’s going to have 6 rows, then you know it must have 4 columns. Similarly, if you know the number of columns then the number of rows is determined. Hence you could have constructed the matrix with just one of the row or column arguments, like this:
<- matrix(numbers, nrow = 6) numbersMat
Notice that the numbers went down the first column, then down the second, and so on. If you would rather fill up the matrix row-by-row, then set the byrow
parameter, which is FALSE
by default, to TRUE
:
matrix(numbers, nrow = 6, byrow = TRUE)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
[6,] 21 22 23 24
Sometimes we like to give names to our rows, or to our columns, or even to both:
rownames(numbersMat) <- letters[1:6]
colnames(numbersMat) <- LETTERS[1:4]
numbersMat
A B C D
a 1 7 13 19
b 2 8 14 20
c 3 9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24
Matrices don’t have to be numerical. They can be character or logical matrices as well:
<- c(
creatures "Dorothy", "Lion", "Scarecrow",
"Oz", "Toto", "Boq"
)matrix(creatures, ncol = 2)
[,1] [,2]
[1,] "Dorothy" "Oz"
[2,] "Lion" "Toto"
[3,] "Scarecrow" "Boq"
If you have to spread out the elements of a matrix into a one-dimensional vector, you can do so:
as.vector(numbersMat)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
7.1.1 Practice Exercises
Let’s work with the following vector:
<- letters[1:12] dozen
- Starting with
dozen
write a command that produces the following matrix:
[,1] [,2] [,3] [,4]
[1,] "a" "d" "g" "j"
[2,] "b" "e" "h" "k"
[3,] "c" "f" "i" "l"
- Starting with
dozen
write a command that produces the following matrix:
[,1] [,2] [,3]
[1,] "a" "e" "i"
[2,] "b" "f" "j"
[3,] "c" "g" "k"
[4,] "d" "h" "l"
- Starting with
dozen
write a command that produces the following matrix:
[,1] [,2] [,3]
[1,] "a" "b" "c"
[2,] "d" "e" "f"
[3,] "g" "h" "i"
[4,] "j" "k" "l"
- Starting with
dozen
, write commands that produce the following matrix:
c1 c2 c3
r1 "a" "b" "c"
r2 "d" "e" "f"
r3 "g" "h" "i"
r4 "j" "k" "l"
- Suppose you make the following matrix:
<- matrix(c(8, 5, 3, 4), nrow =2)
smallMat smallMat
[,1] [,2]
[1,] 8 3
[2,] 5 4
What’s a one-line command to get the folowing vector from smallMat
?
[1] 8 5 3 4
nrow()
is a function that, when given a matrix, will tell you the number of rows in that matrix. Write a one-line command to find the number of rows in a matrix calledmysteryMat
.ncol()
is a function that, when given a matrix, will tell you the number of columns in that matrix. Write a one-line command to find the number of columns in a matrix calledmysteryMat
.
7.1.2 Solutions to Practice Exercises
- Here’s one way to do it:
matrix(dozen, nrow = 3)
Here’s another way:
matrix(dozen, ncol = 4)
- Here’s one way to do it:
matrix(dozen, nrow = 4)
- Here’s one way to do it:
matrix(dozen, nrow = 4, byrow = TRUE)
- Here’s one way to do it:
<- matrix(dozen, nrow = 4, byrow = TRUE)
answerMatrix rownames(answerMatrix) <- c("r1", "r2", "r3", "r4")
colnames(answerMatrix) <- c("c1", "c2", "c3")
answerMatrix
- Here’s how:
as.vector(smallMat)
The command
nrow(mysteryMat)
will work.The command
ncol(mysteryMat)
will work.
7.2 Matrix Indexing
Matrices are incredibly useful in data analysis, but the primary reason we are talking about them now is to get you used to working in two dimensions. Let’s practice sub-setting with matrices.
We use the sub-setting operator [
to pick out parts of a matrix. For example, in order to get the element in the second row and third column of numbersMat
, ask for:
2,3] numbersMat[
[1] 14
The row and column numbers are called indices.
If we want the entire second row, then we could ask for:
2,1:4] numbersMat[
A B C D
2 8 14 20
The result is a one-dimensional vector consisting of the elements in the second row of numbersMat
. It inherits as its names the column names of numbersMat
.
Actually, if you want the entire row you don’t have to specify which columns you want. Just leave the spot after the comma empty, like this:
2, ] numbersMat[
A B C D
2 8 14 20
What if you want some items on the second row, but only the items in columns 1, 2 and 4? Then frame your request in terms of a vector of column-indices:
2, c(1, 2, 4)] numbersMat[
A B D
2 8 20
You can specify a vector of row-indices along with a vector of column-indices, if you like:
1:2, 1:3] numbersMat[
A B C
a 1 7 13
b 2 8 14
If the vector has row or column names then you may use them in place of indices to make a selection:
c("B", "D")] numbersMat[,
B D
a 7 19
b 8 20
c 9 21
d 10 22
e 11 23
f 12 24
You can use sub-setting to change the values of the elements of a matrix
2,3] <- 0
numbersMat[ numbersMat
A B C D
a 1 7 13 19
b 2 8 0 20
c 3 9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24
You can assign a value to an entire row:
2,] <- 0
numbersMat[ numbersMat
A B C D
a 1 7 13 19
b 0 0 0 0
c 3 9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24
In the code above, the 0 was “recycled” into each of the four elements of the second row
You can assign the elements of a vector to corresponding selected elements of a matrix:
2,] <- c(100, 200, 300, 400)
numbersMat[ numbersMat
A B C D
a 1 7 13 19
b 100 200 300 400
c 3 9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24
7.2.1 To Drop or Not?
Note that when we ask for a single row of numbersMat
we got a regular one-dimensional vector:
3, ] numbersMat[
A B C D
3 9 15 21
The same things happens if we ask for a single column:
2] numbersMat[ ,
a b c d e f
7 200 9 10 11 12
We get the second column of numbersMat
, but as a regular vector. It’s not a “column” anymore. (Note that it inherits the row names from numbersMat
.)
When a subset of a matrix comes from only one row or column, R takes the opportunity to “drop” the class of the subset from “matrix” to “vector.” If you would like the subset to stay a vector, set the drop
parameter, which by default is TRUE
, to FALSE
. Thus the second column of numbersMat
, kept as a matrix with six rows and one column, is found as follows:
2, drop = FALSE] numbersMat[ ,
B
a 7
b 200
c 9
d 10
e 11
f 12
In most applications people want the simpler vector structure, so they usually leave drop
at its default value.
7.2.2 Practice Exercises
In these exercises we’ll work with the following matrix:
<- 1:40
numbers <- matrix(numbers, nrow = 4)
practiceMatrix rownames(practiceMatrix) <- letters[1:4]
colnames(practiceMatrix) <- LETTERS[1:10]
practiceMatrix
A B C D E F G H I J
a 1 5 9 13 17 21 25 29 33 37
b 2 6 10 14 18 22 26 30 34 38
c 3 7 11 15 19 23 27 31 35 39
d 4 8 12 16 20 24 28 32 36 40
- Write two different one-line commands to get this matrix:
B C D E
a 5 9 13 17
c 7 11 15 19
- Write a one-line command to get this matrix:
A C E G I
a 1 9 17 25 33
b 2 10 18 26 34
c 3 11 19 27 35
d 4 12 20 28 36
- Write a one-line command to get this vector:
a b c d
1 2 3 4
- Write a one-line command to get this vector:
A B C D E F G H I J
2 6 10 14 18 22 26 30 34 38
- Write a one-line command to get this matrix:
A
a 1
b 2
c 3
d 4
- Write a convenient one-line command to get this matrix:
A B C D E F G H I
a 1 5 9 13 17 21 25 29 33
b 2 6 10 14 18 22 26 30 34
c 3 7 11 15 19 23 27 31 35
d 4 8 12 16 20 24 28 32 36
- Write a convenient one-line command to get this matrix:
A C D E F G H I
a 1 9 13 17 21 25 29 33
b 2 10 14 18 22 26 30 34
c 3 11 15 19 23 27 31 35
d 4 12 16 20 24 28 32 36
- Write a function called
myRowSums()
that will find the sums of the rows of any given matrix. The function should use afor
-loop (see the Chapter on Flow Control). The function should take a single parameter calledmat
, the matrix whose rows the user wishes to sum. It should work like this:
<- matrix(1:24, ncol = 6)
myMatrix myRowSums(mat = myMatrix)
[1] 66 72 78 84
7.2.3 Solutions to Practice Exercises
- Here are two ways:
c(1,3), 2:5]
practiceMatrix[c("a","c"), 2:5] practiceMatrix[
- Here’s one way:
seq(1, 9, by = 2)] practiceMatrix[ ,
- Here’s one way:
1] practiceMatrix[ ,
- Here’s one way:
2, ] practiceMatrix[
- Here’s one way:
1, drop = FALSE] practiceMatrix[ ,
- Here’s one way:
-10] practiceMatrix[ ,
- Here’s one way:
-c(2, 10)] practiceMatrix[ ,
- Here is one way to write the function:
<- function(mat) {
myRowSums <- nrow(mat)
n <- numeric(n)
sums for (i in 1:n) {
<- sum(mat[i, ])
sums[i]
}
sums }
7.3 Operations on Matrices
Matrices can be involved in arithmetical and logical operations.
7.3.1 Arithmetical Operations
The usual arithmetic operations apply to matrices, operating element-wise. For example, suppose that we have:
<- matrix(rep(1, 4), nrow = 2)
mat1 <- matrix(rep(2, 4), nrow = 2) mat2
To get the sum of the above two matrices, R adds their corresponding elements and forms a new matrix out of their sums, thus:
+ mat2 mat1
[,1] [,2]
[1,] 3 3
[2,] 3 3
R applies recycling as needed. For example, suppose we have:
<- matrix(1:4, nrow = 2)
mat mat
[,1] [,2]
[1,] 1 3
[2,] 2 4
In order to multiply each element of mat
by 2, we need not create a 2-by-2 matrix of 2’s. We can simply multiply by 2, and R will take care of recycling the 2:
2 * mat
[,1] [,2]
[1,] 2 6
[2,] 4 8
Or we could subtract 3 from each element of mat
:
- 3 mat
[,1] [,2]
[1,] -2 0
[2,] -1 1
7.3.2 Logical Operations
Boolean operations apply to matrices element-wise, just as they do to ordinary vectors. The result is a matrix of logical values. For examples, consider the original matrix numbersMat
:
<- matrix(1:24, nrow = 6) numbersMat
Suppose we wish to determine which elements of numbersMat
are odd. Then we simply ask whether the remainder of an element after division by 2 is equal to 1:
%% 2 == 1 numbersMat
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE TRUE
[2,] FALSE FALSE FALSE FALSE
[3,] TRUE TRUE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
[5,] TRUE TRUE TRUE TRUE
[6,] FALSE FALSE FALSE FALSE
We can select elements from a matrix using a Boolean operator, too:
%% 2 == 1] numbersMat[numbersMat
[1] 1 3 5 7 9 11 13 15 17 19 21 23
Note that the result is an ordinary, one-dimensional vector.
7.3.3 Practice Exercises
We’ll work with the following three matrices:
<- matrix(c(7, 4, 9, 10), nrow = 2)
a a
[,1] [,2]
[1,] 7 9
[2,] 4 10
<- matrix(1:4, nrow = 2)
b b
[,1] [,2]
[1,] 1 3
[2,] 2 4
<- matrix(letters[1:24], nrow = 6, byrow = TRUE)
c c
[,1] [,2] [,3] [,4]
[1,] "a" "b" "c" "d"
[2,] "e" "f" "g" "h"
[3,] "i" "j" "k" "l"
[4,] "m" "n" "o" "p"
[5,] "q" "r" "s" "t"
[6,] "u" "v" "w" "x"
- Find a one-line command using
a
that results in:
[,1] [,2]
[1,] 10 12
[2,] 7 13
- Find a one-line command using
a
that results in:
[,1] [,2]
[1,] 14 18
[2,] 8 20
- Find a one-line command using
a
that results in:
[,1] [,2]
[1,] 49 81
[2,] 16 100
- Find a one-line command using
a
andb
that results in:
[,1] [,2]
[1,] 6 6
[2,] 2 6
- Describe in words what the following command does:
> 5 a
Write a one-line command using
a
that tells you which elements ofa
are one more than a multiple of 3.Using
c
, write a one-line boolean expression that produces the following:
[,1] [,2] [,3] [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE TRUE
[3,] TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE
[6,] TRUE TRUE TRUE TRUE
7.3.4 Solutions to Practice Exercises
- Here’s one way:
+ 3 a
- Here’s one way:
2 * a
- Here’s one way:
^2 a
- Here’s one way:
- b a
It produces a logical matrix of the same dimensions as
a
. The new matrix will haveTRUE
in a cell when the corresponding cell ofa
is greater than 5. Otherwise, the cell will haveFALSE
in it.Here’s one way:
%% 3 == 1 a
- Here’s one way:
>= "h" c
7.4 Introduction to Data Frames
R is sometimes spoken of as a domain-specific programming language, meaning that although it can in principle perform any sort of computation that a human can perform (given enough pencil, paper and time), it was originally designed to perform tasks in a particular area of application. R’s original area of application is data analysis and statistics, especially when performed interactively—i.e., in a setting where the analyst asks for a relatively small computation, examines the results, modifies his or her requests and asks again, and so on.1 Although R can be used effectively for a wide range of programming tasks, data analysis is where it really shines.
The data structures of R reflect its orientation to data analysis. We have met a data-oriented structure already—the table, which is one of many convenient ways to display the results of data analysis. For the purpose of organizing data in preparation for analysis, R provides the structure known as the data frame. A data frame facilitates the storage of related data in one location, in a form that makes the most sense to human users.
A data frame is like a matrix in that it is two-dimensional—it has rows and columns. Unlike a matrix, though, the elements of a data frame do not have to be all of the same data-type. Each column of a data frame is a vector—of the same length as all the others—but these vectors may be of different types: some numerical, some logical, etc.
7.4.1 Viewing a Data Frame
Let’s take a close look at a data frame: the frame m111survey
, which is available from the bcscr package (White 2024). First let’s attach the package itself:
library(bcscr)
In the R Studio IDE, we can get a look at the frame in a tab in the Editor pane if we use the View()
function:
View(m111survey)
As with many objects provided by a package, we can get more information about it:
help("m111survey")
From the Help we see that m111survey
records the results of a survey conducted in a number of sections of an elementary statistics course at Georgetown College. From the View we see that the frame is arranged in rows and columns. Each row corresponds to what in data analysis is known as a case or an individual: here, each row goes with a student who participated in the survey. The columns correspond to variables: measurements made on each individual. For a student on a given row, the values in the columns are the values recorded for that student.
When you are not working in R Studio, there are still a couple of way so view the frame. You could print it all out to the console:
m111survey
You could also use the head()
function to view a specified number of initial rows:
head(m111survey, n = 6) # see first six rows
7.4.2 The Stucture of a Data Frame
Further information about the frame may be obtained with the str()
function:
str(m111survey)
'data.frame': 71 obs. of 12 variables:
$ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
$ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
$ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
$ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
$ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
$ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
$ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
$ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
$ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
$ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
The concept of structure extends far beyond the domain of computer programming.2 In general the structure of any object consists of:
- the kind of thing that the object is;
- the parts of the object is made up of;
- the relationships between these parts—the rules, if you will, for how the parts work together to make the object do what it does.
In the case of m111survey
the kind of thing this is its class: it’s a data frame.
class(m111survey)
[1] "data.frame"
Next we see the account of the parts of the object and the way in which the parts relate to one another:
71 obs. of 12 variables
From this we know that there are 71 individuals in the study. The data consists of 12 “parts”—the variables—which are related in the sense that they all provide information about the same set of 71 people.
After that the output of str()
launches into an account of the structure of each of the parts, for example:
$ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
We are told the kind of thing that height is: it’s a numerical vector (a vector of type double
, in fact). Next we are given the beginning of a statement of its parts: the heights of the individuals. So R is actually giving us the structure of the parts, as well as of the whole m11survey
.
The variable fastest
refers to the fastest speed—in miles per hour—that a person has ever driven a car. Note that it is a vector of type integer
. Officially this is a numerical variable, too, but R is calling attention to the fact that the fastest-speed data is being stored as integers rather than as floating-point decimals.
The variables of a data frame are typically associated with the names of the frame:
names(m111survey)
[1] "height" "ideal_ht" "sleep" "fastest"
[5] "weight_feel" "love_first" "extra_life" "seat"
[9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
By means of the names we can isolate a vector in any column, identified in our code in the format frame$variable
. For example, to see the first ten elements of the fastest
variable, we ask for:
$fastest[1:10] m111survey
[1] 119 110 85 100 95 100 85 160 90 90
In order to compute the mean fastest speed our subjects drove their cars, we can ask for:
mean(m111survey$fastest, na.rm = TRUE)
[1] 105.9014
If you want to see the speeds that are at least 150 miles per hour, you could ask for:
$fastest[m111survey$fastest >= 150] m111survey
[1] 160 190
If you worry that the form frame$variable
will require an annoying amount of typing—as seems to be the case in the the example above—then you can use the with()
function:
with(m111survey, fastest[fastest >=150])
[1] 160 190
It’s instructive to consider how with()
works. If we were to includes the names of the parameters of with()
explicitly, then the call would have looked like this:
with(data = m111survey, expr = fastest[fastest >=150])
For the data
parameter we can supply a data frame or any other R-object that can be used to construct an environment . In this case m111survey
provides a miniature environment consisting of the names of its variables. For the expr
parameter we supply an expression for R to evaluate. As R evaluates the expression, it encounters names (such as fastest
). Now ordinarily R would first search whatever counts as the active environment—in this case it’s the Global Environment—for the names in the expression, but with()
forces R to look first within the environment created by the data
argument. In our example, R finds fastest
inside m111survey
and evaluates the expression on that basis. If it had not found fastest
in m111survey
, R would have moved on to the Global Environment and then the rest of the usual search path and (probably) would have found nothing, causing it to throw an “object not found” error message. In R, as in any other programming language, good programming depends very much on paying attention to how the language searches for the objects to which names refer.
7.4.3 Factors
Some of the variables in m111survey
are called factors; an example is seat
, which pertains to where one prefers to sit in a classroom:
str(m111survey$seat)
Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
Seating preference is an example of a categorical variable: one whose values are not meaningfully expressed in terms of numbers. When a categorical variable has a relatively small number of possible values, it can be convenient to store its values in a vector of class factor
.
The levels of factor variable are its possible values. In the case of seat
, these are: Front, Middle and Back. As a memory-saving measure, R stores the values in the factor as numbers, where 1 stands for the first level, 2 for the second level, and so on. But please bear in mind that we are dealing with a categorical variable, so the numbers don’t relate to the possible values in any natural way: they are just storage conventions.
It’s possible to create a factor from any type of vector, but most often this is done with a character vector. Suppose for instance, that eight people are asked for their favorite Wizard of Oz character and they answer:
<- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
ozFavs "Glinda", "Scarecrow", "Dorothy")
We can create a factor variable as follows:
<- factor(ozFavs)
factorFavs factorFavs
[1] Glinda Toto Toto Dorothy Toto Glinda Scarecrow
[8] Dorothy
Levels: Dorothy Glinda Scarecrow Toto
Note that the levels are given in alphabetical order: this is the default procedure when R creates a factor. It is possible to ask for a different order, though:
factor(ozFavs, levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))
[1] Glinda Toto Toto Dorothy Toto Glinda Scarecrow
[8] Dorothy
Levels: Toto Scarecrow Glinda Dorothy
In many instances it is appropriate to convert a character vector to a factor, but sometimes this is not such a great idea. Consider something like your address, or your favorite inspirational quote: pretty much every person in a study will have a different address or favorite quote than others in the study. Hence there won’t be any memory-storage benefit associated with creating a factor: the vector of levels—itself a character vector—would require as much storage space as the original character vector itself! In addition, we will see that the status of a variable as class “factor” can affect how R’s statistical and graphical functions deal with it. It’s not a good idea to treat a categorical variable as a factor unless its set of possible values is considered important.
We will think more about how to deal with factor variables later on, when we begin data analysis in earnest.
7.4.4 Practice Exercises
How would you learn more about the data frame
railtrail
from the bcscr package?Write a one-line command to see the first 10 rows of
railtrail
in the Console.Write a one-line command to get the names of all of the variables in
railtrail
.Regarding
railtrail
: write a one-line command to get the high temperature on all the days when the precipitation was more than 0.5 inches.Regarding
railtrail
: write a one-line command to sort the average temperatures from highest to lowest.
7.4.5 Solutions to the Practice Exercises
- One way is to attach the package, then then ask for help:
library(bcscr)
help(railtrail)
Another way is to refer to the data frame through the package, with double colons:
help("railtrail", package = "bcscr")
That way you don’t have to add all the items in bcscr to your search path.
- Here’s one way:
head(bcscr::railtrail, n = 10)
It’s
names(bcscr::railtrail)
.Here’s one way:
with(bcscr::railtrail, hightemp[precip > 0.5])
- Try this:
sort(bcscr::railtrail$avgtemp, decreasing = TRUE)
7.5 Creating Data Frames
There are many ways to create data frames in R. Here we will introduce just two ways.
7.5.1 Creation from Vectors
Whenever you have vectors of the same length, you can combine them into a data frame, using the data.frame()
function:
<- c("Dorothy", "Lion", "Scarecrow")
n <- c(58, 75, 69)
h <- c(12, 0.04, 18)
a <- data.frame(name = n, height = h, age = a)
ozFolk ozFolk
name height age
1 Dorothy 58 12.00
2 Lion 75 0.04
3 Scarecrow 69 18.00
Note that at the time of creation you can provide the variables with any names that you like. If later on you change your mind about the names, you can always revise them:
names(ozFolk)
[1] "name" "height" "age"
names(ozFolk)[2] <- "Height" # "height" was at index 2"
ozFolk
name Height age
1 Dorothy 58 12.00
2 Lion 75 0.04
3 Scarecrow 69 18.00
7.5.2 Creation From Other Frames
If two frames have the same number of rows, you may combine their columns to form a new frame with the cbind()
function:
<- data.frame(
ozMore color = c("blue", "red", "yellow"),
desire = c("Kansas", "courage", "brains")
)cbind(ozFolk, ozMore)
name Height age color desire
1 Dorothy 58 12.00 blue Kansas
2 Lion 75 0.04 red courage
3 Scarecrow 69 18.00 yellow brains
Similarly if two data frames have the same number and type of columns then we can use the rbind()
function to combine them:
<- data.frame(
ozFolk2 name = c("Toto", "Glinda"),
Height = c(12, 66), age = c(3, 246)
)rbind(ozFolk, ozFolk2)
name Height age
1 Dorothy 58 12.00
2 Lion 75 0.04
3 Scarecrow 69 18.00
4 Toto 12 3.00
5 Glinda 66 246.00
Note: cbind()
and rbind()
work for matrices, too.
7.6 Subsetting Data Frames
Our study of sub-setting matrices can be applied to the selection of parts of a data frame. As with a vector, one or both of the dimensions of the frame can come into play.
We can create a new data frame consisting of any columns we like from the original frame:
<- m111survey[, c("height", "ideal_ht")]
df head(df)
height ideal_ht
1 76.0 78
2 74.0 76
3 64.0 NA
4 62.0 65
5 72.0 72
6 70.8 NA
If we select just one column, then the result is a vector rather than a data frame:
<- m111survey[, "height"]
df is.vector(df)
[1] TRUE
If for some reason you want to prevent this, set drop
to FALSE
:
<- m111survey[, "height", drop =FALSE]
df head(df)
height
1 76.0
2 74.0
3 64.0
4 62.0
5 72.0
6 70.8
You may select particular rows, too:
10:15, c("height", "ideal_ht")] m111survey[
height ideal_ht
10 67 67
11 65 69
12 62 62
13 59 62
14 78 75
15 69 72
You can even select some of the rows at random. Here is a random sample of size six:
<- nrow(m111survey)
n <- m111survey[sample(1:n, size = 6, replace = FALSE), ]
df c("sex", "seat")] # show just two columns df[
sex seat
13 female 1_front
54 male 2_middle
56 male 3_back
28 female 1_front
53 female 3_back
46 female 2_middle
Note the function nrow()
that gives the number of rows of the frame. When we sample six items without replacement from the vector 1:n
, we are picking six numbers at random from the row-numbers of the vector. Specifying these six numbers in the selection operator [
yields the desired random sample of rows.
7.6.1 Boolean Expressions
It is especially common to select rows by the values of a logical vector. For example, to select the rows where the fast speed ever driven is at least 150 miles per hour, try this:
<- m111survey[m111survey$fastest >= 150, ]
df c("sex", "fastest")] # show just two of the variables df[,
sex fastest
8 male 160
32 male 190
When you are selecting rows it can be convenient to use the subset()
function. The first argument to the function is the frame from which you plan to select, and the second is the Boolean expression by which to select:
<- subset(m111survey, fastest >= 150)
df c("sex", "fastest")] df[,
sex fastest
8 male 160
32 male 190
Note that we did not need to type m111survey$fastest
: the first argument to subset()
provides the environment in which to search for names that appear in the Boolean expression.
The Boolean sub-setting expressions can be quite complex:
<- subset(m111survey, seat == "3_back" & height < 72 & sex == "female")
df c("sex", "height", "seat")] df[,
sex height seat
9 female 59 3_back
20 female 65 3_back
30 female 69 3_back
53 female 69 3_back
70 female 65 3_back
Note: subset()
takes a third parameter called select
that allows you to pick out any desired columns. For example:
subset(m111survey, seat == "3_back" & height < 72 & sex == "female",
select = c("sex", "height", "seat"))
sex height seat
9 female 59 3_back
20 female 65 3_back
30 female 69 3_back
53 female 69 3_back
70 female 65 3_back
7.6.2 Practice Exercises
We’ll use the CPS85
data frame from the mosaicData package. You should go ahead and load the package and then read about the data frame:
library(mosaicData)
?CPS85
Each row in the data frame corresponds to an employee in the survey.
Write a command that gives the number of employees in the data frame.
Select the employees who are between 40 and 50 years old.
Select the employees who are married and have fewer than 30 years of experience.
Select the nonunion employees who either live in the South or who have more than 12 years of education (or both).
Select the employees who work in the clerical, construction, management or professional sector.
Select the employees who make more than 30 dollars per hour, and keep only their wage, sex and sector of employment
Select 10 employees at random, keeping only their wage and sex.
Select all of the employees, keeping all information about them except for their union status and whether or not they are from the South.
7.6.3 Solutions to Practice Exercises
The command is
nrow(CPS85)
.Try this:
subset(CPS85, age > 40 & age < 50)
- Try this:
subset(CPS85, married == "Married" & exper < 30)
- Try this:
subset(CPS85, union == "Not" & (south == "S" | educ > 12))
- Try this:
subset(CPS85, sector %in% c("clerical", "construction",
"management", "professional"))
- Try this:
$wage > 30, c("wage", "sex", "sector")] CPS85[CPS85
- Try this:
sample(1:nrow(CPS85), size = 10, replace = FALSE),
CPS85[c("wage", "sex")]
- Try this (
south
andunion
are columns 6 and 9, respectively):
-c(6, 9)] CPS85[ ,
The select
parameter of the subset()
function has a little known feature that allows you to specify columns to omit by name, so the following is another solution:
subset(CPS85, select = -c(south, union))
7.7 New Variables from Old
Quite often you will want to transform one or more variables in a data frame. Transforming a variable means changing its values in a systematic way.
For example, you might want to measure height in feet rather than inches. Then you want the following
<- with(m111survey, height/12) # 12 inches in a foot heightInFeet
If you plan to use this new variable in your analysis later on, it might be a good idea to add it to the data frame:
$height_ft <- heightInFeet m111survey
Another common need is to recode the values of a categorical variable. For example, you might want to divide people into two groups: those who prefer to sit in the back and those who don’t. This is a good time to use ifelse()
:
<- ifelse(m111survey$seat == "3_back", "Back", "Other")
seat2 $seat2 <- seat2 m111survey
If you plan to re-code into a variable that involves more than two values, then you might want to look into the mapvalues()
function from the plyr package (Wickham 2023):
<- plyr::mapvalues(
seat3 $seat,
m111surveyfrom = c("1_front", "2_middle", "3_back"),
to = c("Front", "Middle", "Back")
)str(seat3)
Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...
Another common transformation involves turning a numerical variable into a factor. For example, we might need to classify people as:
- Tall (height over 70 inches)
- Medium (65 - 70 inches)
- Short (less than 65 inches)
The cut()
function will be helpful.
Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 2 3 1 2 ...
Setting right = TRUE
indicates that the upper bound of each interval is included in the interval. Thus, a person with a height of 70 inches is classed as Medium, not Tall.
7.7.1 Getting Rid of Variables
We have added several variables to m111survey
. In order to remove them (or any other variables we don’t want) we can assign them the value NULL
.
names(m111survey)
[1] "height" "ideal_ht" "sleep" "fastest"
[5] "weight_feel" "love_first" "extra_life" "seat"
[9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
[13] "height_ft" "seat2"
$height_ft <- NULL
m111survey$seat2 <- NULL
m111survey$seat3 <- NULL
m111surveynames(m111survey) # the extra variables are gone
[1] "height" "ideal_ht" "sleep" "fastest"
[5] "weight_feel" "love_first" "extra_life" "seat"
[9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
7.7.2 Practice Exercises
Remove the variables
hispanic
andmarried
from themosaicData::CPS85
data frame.Change the units of
wage
inmosaicData::CPS85
from dollars per hour to dollars per day. Assume an eight-hour working day.For
CPS85
, create a new variableexperGrp
that has the following values
low
for experience less than 10 years;medium
for experience of at least 10 years but less than 25 years;high
for experience at least 25 years.
- Using the
experGrp
variable in the previous exercise, create the following tally of the ages of the employees:
experGrp
low medium high
179 217 138
- You’ve made some changes to
CPS85
, but in fact you haven’t changed the original data frame in the mosaicData package—you’ve simply made your own copy, which should now be in your Global Environment. Since the Global Environment comes before any package on your search path, if you want to get to the originalCPS85
you will either have to refer to it asmosaicData::CPS85
. Another option, though, is to remove the modified copy from your Global Environment. Go ahead and remove it now.
7.7.3 Solutions to Practice Exercises
- Here’s one way to do it:
$hispanic <- NULL
CPS85$married <- NULL CPS85
- Here’s one way to do it:
$wage <- CPS85$wage * 8 CPS85
- Here’s one way to do it:
$experGrp <- cut(
CPS85$exper,
CPS85breaks = c(-Inf, 10, 25, Inf),
labels = c("low", "medium", "high")
)
Use
table(CPS85$experGrp)
.Here’s what to do:
rm(CPS85)
7.8 More in Depth
7.8.1 Matrix Multiplication
This section may interest you if you know about matrix multiplication in linear algebra.
In order to accomplish matrix multiplication, we have to keep in mind that the regular multiplication operator *
works element-wise on matrices, as we have already seen. For matrix multiplication R provides the special operator %*%
. For example, consider the following matrices:
<- matrix(1:6, ncol = 3)
a a
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
<- matrix(c(2, 1, -1), nrow = 3)
b b
[,1]
[1,] 2
[2,] 1
[3,] -1
Observe that the number of columns of a
is equal to the number of rows of b
. Hence it is possible to form the matrix product a %*% b
:
%*% b a
[,1]
[1,] 0
[2,] 2
As expected, the result is a matrix having as many rows as the rows of a
and as many columns as the columns of b
.
It is also interesting to recall how matrix multiplication works when the second matrix has only one column. The product is obtained by multiplying each column of a
by the element on the corresponding row of b
, and adding the resulting matrices:
1,1]*a[ ,1, drop = FALSE] + b[2,1, drop = FALSE]*a[ ,2] + b[3,1]*a[ ,3, drop = FALSE] b[
[,1]
[1,] 0
[2,] 2
7.8.2 Ordering Data Frames
You can reorder as well as select. For example, the following code selects the first five rows ofm111survey
and then reverses them:
<- m111survey[, c("height", "ideal_ht")]
df <- df[5:1, ]
dfRev head(dfRev)
height ideal_ht
5 72 72
4 62 65
3 64 NA
2 74 76
1 76 78
If you want, you can even scramble the rows of the data frame in a random order:
<- nrow(m111survey)
n <- sample(1:n, size = n, replace = FALSE)
shuffle <- m111survey[shuffle, ]
df head(df[c("sex", "seat")]) #show just two columns
sex seat
25 female 2_middle
51 female 2_middle
69 female 1_front
52 female 2_middle
64 male 3_back
13 female 1_front
It is quite common to order the rows of a frame according to the values of a particular variable. For example, you might want to arrange the rows by height
, so that the frame begins with the shortest subject and ends with the tallest.
Accomplishing this task requires a study of R’s order()
function. Consider the following vector:
<- c(15, 12, 23, 7) vec
Call order()
with this vector as an argument:
order(vec)
[1] 4 2 1 3
order()
returns the indices of the elements of vec
, in the following order:
- the index of the smallest element (7, at index 4 of
vec
); - the index of the second-smallest element (12, at index 2 of
vec
); - the index of the third-smallest element (15, at index 1 of
vec
); - the index of the largest element (23, at index 3 of
vec
).
Can you guess the output of the following function-call without looking for the answer underneath?
order(vec)] vec[
[1] 7 12 15 23
Sure enough, the result is vec
sorted: from smallest to largest element.
Now the sorting of vec
could have been accomplished with R’s sort()
function:
sort(vec)
[1] 7 12 15 23
The power of order()
comes with the rearrangement of rows of a data frame. In order to “sort” the frame from shortest to tallest subject, call:
<- m111survey[order(m111survey$height), ]
df head(df[, c("sex", "height")]) # to show that it worked
sex height
45 female 51
26 female 54
9 female 59
13 female 59
40 female 60
69 female 61
If you want to order the rows from tallest to shortest instead, then use the decreasing
parameter, which by default is FALSE
:
<- m111survey[order(m111survey$height, decreasing = TRUE), ]
df head(df[, c("sex", "height")]) # to show that it worked
sex height
8 male 79
14 female 78
1 male 76
58 male 76
34 male 75
54 male 75
Sometimes you want to order by two or more variables. For example suppose you want to arrange the frame so that the folks preferring to sit in front come first, followed by the people who prefer the middle and ending with the people who prefer the back. Within these groups you would like people to be arranged from shortest to tallest. Then call:
<- with(m111survey, order(seat, height))
ordering <- m111survey[ordering, ]
df head(df[, c("seat", "height")], n = 10) # see if it worked
seat height
45 1_front 51
26 1_front 54
13 1_front 59
69 1_front 61
4 1_front 62
12 1_front 62
23 1_front 63
38 1_front 63
61 1_front 63
57 1_front 64
7.8.3 Combining With rbind()
and cbind()
If two matrices have the same number of rows, then you can bind their columns together to create a new matrix, using the cbind()
function:
<- matrix(letters, nrow = 13)
lowercase <- matrix(LETTERS, nrow = 13)
uppercase <- cbind(lowercase, uppercase)
both_cases both_cases
[,1] [,2] [,3] [,4]
[1,] "a" "n" "A" "N"
[2,] "b" "o" "B" "O"
[3,] "c" "p" "C" "P"
[4,] "d" "q" "D" "Q"
[5,] "e" "r" "E" "R"
[6,] "f" "s" "F" "S"
[7,] "g" "t" "G" "T"
[8,] "h" "u" "H" "U"
[9,] "i" "v" "I" "V"
[10,] "j" "w" "J" "W"
[11,] "k" "x" "K" "X"
[12,] "l" "y" "L" "Y"
[13,] "m" "z" "M" "Z"
If two matrices have the same number of columns, then you can bind their rows together, with rbind()
:
<- matrix(letters, ncol = 13)
lowercase2 <- matrix(LETTERS, ncol = 13)
uppercase2 <- rbind(lowercase2, uppercase2)
both_cases2 both_cases2
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y"
[2,] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
[3,] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
[4,] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"
rbind()
and cbind()
work with data frames, too. Here, we use rbind()
to add a new row to a data frame:
<- c("Dorothy", "Scarecow", "Lion")
n <- c(58, 75, 69)
h <- c(12, 0.04, 18)
a <- data.frame(name = n, height = h, age = a)
oz_folk <- data.frame(
one_more_person name = "Tin Man",
height = 72,
age = 24
)<- rbind(oz_folk, one_more_person)
all_together all_together
name height age
1 Dorothy 58 12.00
2 Scarecow 75 0.04
3 Lion 69 18.00
4 Tin Man 72 24.00
We can add new columns as well:
<- data.frame(
new_properties desire = c("Kansas", "brains", "courage", "a heart"),
fav_color = c("crimson", "blue", "burlywood", "orange")
)cbind(all_together, new_properties)
name height age desire fav_color
1 Dorothy 58 12.00 Kansas crimson
2 Scarecow 75 0.04 brains blue
3 Lion 69 18.00 courage burlywood
4 Tin Man 72 24.00 a heart orange
7.8.4 Practice Exercises
- Consider the following vector:
<- c("Mole", "Frog", "Rat", "Badger") creatures
Write down what you think will be the result of the call:
order(creatures)
Then check your answer by actually running:
<- c("Mole", "Frog", "Rat", "Badger")
creatures order(creatures)
- What will be the result of the following?
order(creatures, decreasing = TRUE)
Arrange the rows of the data frame
mosaicData::CPS85
in order, from the lowest to the highest wage. Break ties by experience (less experience coming before more experience).Arrange the rows of the data frame
mosaicData::CPS85
in order, from the lowest to the highest wage. Break ties by experience (more experience coming before less experience).Review the
all_walk()
function from Section @ref(nested-loops). Write a function calledall_walk_df()
that, instead of returning the total number of flowers picked, returns a data frame that records the sequence of flowers picked by each person. You may omit the option for a report along the way. Recall that the colors of the flowers in the field were:
<- c("blue", "red", "pink", "crimson", "orange") flower_colors
A typical example of use would be:
all_walk_df(
people = c("Dorothy", "Scarecrow"),
favs = c("crimson", "blue"),
numbers = c(2, 1)
)
name flower
1 Dorothy orange
2 Dorothy crimson
3 Dorothy crimson
4 Scarecrow red
5 Scarecrow blue
7.8.5 Solutions to Practice Exercises
- Here’s what you get:
order(creatures)
[1] 4 2 1 3
- Here’s what you get:
order(creatures, decreasing = TRUE)
[1] 3 1 2 4
- Here is one way:
order(CPS85$wage, CPS85$exper), ] CPS85[
- Here is one way:
order(CPS85$wage, CPS85$exper,
CPS85[decreasing = c(FALSE, TRUE)), ]
- Try this:
## helper-function to make df for one person:
<- function(person, color, wanted) {
walk_meadow_df <- TRUE
picking ## the following will be extended to hold the flowers picked:
<- character()
flowers_picked <- 0
desired_count while (picking) {
<- sample(flower_colors, size = 1)
picked <- c(flowers_picked, picked)
flowers_picked if (picked == color) desired_count <- desired_count + 1
if (desired_count == wanted) picking <- FALSE
}## return the data frame:
data.frame(
name = rep(person, times = length(flowers_picked)),
flower = flowers_picked
)
}
<- function(people, favs, numbers) {
all_walk_df ## start with a data frame with 0 rows
## and columns named correctly:
<- data.frame(
df name = character(),
flower = character()
)for (i in 1:length(people)) {
<- people[i]
person <- favs[i]
fav <- numbers[i]
number <- walk_meadow_df(
person_df person = person,
color = fav,
wanted = number
)## extend df:
<- rbind(df, person_df)
df
}## return the complete data frame:
df }
The Main Ideas of This Chapter
- Matrices are atomic vectors, with two additional attributes: number of rows, and number of columns.
- Since matrices are vectors, you can subset them with the
[
-operator. You just have to account for rows and columns with a separating comma (e.g.,myMatrix[3, 5]
). - If you subset a matrix to get just one row or one column, then the result is “dropped” to an ordinary vector, unless you set
drop
toFALSE
. - Arithmetic operations work pairwise on matrices, just like they do on vectors.
- Like matrices, data frames are two dimensional, but their columns do not have to be all the same type of atomic vector.
- You can access a column in a data frame with the
$
-operator (e.g.,m111survey$fastest
). - Subsetting data frames can be done with the
[
-operator, just like matrices. - You can also subset data frames with the
subset()
function.
Links to Slides
Quarto Presentations that I sometimes use in class:
Glossary
- Matrix
-
An atomic vector that has two additional attributes: a number of rows and a number of columns.
- Data Frame
-
A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.
- Case (also called an Individual)
-
An individual unit under study. In a data frame in R, the rows correspond to cases.
- Variable (in Data Analysis)
-
In data analysis, a variable is a measurement made on the individuals in a study.
- Categorical Variable (in Data Analysis)
-
In data analysis, a categorical variable is a variable whose values cannot be expressed meaningfully by numbers.
Exercises
Exercise 1
- R has a function called
t()
that computes the transpose of a given matrix. This means that it switches around the rows and columns of the matrix, like this:
<- matrix(1:24, nrow = 6)
myMatrix myMatrix
[,1] [,2] [,3] [,4]
[1,] 1 7 13 19
[2,] 2 8 14 20
[3,] 3 9 15 21
[4,] 4 10 16 22
[5,] 5 11 17 23
[6,] 6 12 18 24
t(myMatrix)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 7 8 9 10 11 12
[3,] 13 14 15 16 17 18
[4,] 19 20 21 22 23 24
Write your own function called transpose()
that will perform the same task on any given matrix. The function should take a single parameter called mat
, the matrix to be transposed. Of course you may NOT use t()
in the code for your function!
Hint: Let’s solve the problem in a general way, on an example.
First, we set up an example, naming it mat
because that’s the required name of the parameter in the function we are supposed to write:
<- matrix(1:12, nrow = 2) mat
Here is mat
:
mat
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 3 5 7 9 11
[2,] 2 4 6 8 10 12
Next, we break mat
down into just the vector of its elements:
<- as.vector(mat) elements
Let’s take a look at the elements:
elements
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Recall that our target is this matrix:
t(mat)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
So we want to put the elements back into a matrix that has 2 rows and six columns. We need to do this in a general way:
matrix(elements, nrow = ncol(mat))
[,1] [,2]
[1,] 1 7
[2,] 2 8
[3,] 3 9
[4,] 4 10
[5,] 5 11
[6,] 6 12
This got the right number of rows and columns, but the elements need to be filled in across rows, not down columns, so instead let’s try:
matrix(elements, nrow = ncol(mat), byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
That worked!
So after we set up the example mat
, the “work” we need to do is:
<- as.vector(mat)
elements matrix(elements, nrow = ncol(mat), byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
You take it from here: encapsulate this work into the required function, and test it on some examples.
Exercise 2
R has functions called rowSums()
and colSums()
that will respectively sum the rows and the columns of a matrix. Here is an example:
<- matrix(1:24, nrow = 6)
myMatrix rowSums(myMatrix)
[1] 40 44 48 52 56 60
Your task is to write your own function called dimSum()
that will sum either the rows or the columns of a given matrix. The function should have two parameters:
mat
: the matrix to be summed.dim
: the dimension to sum along, either rows or columns. The default value should be"rows"
. If the user setsdim
to"columns"
then the function would compute the column-sums.
You may NOT use rowSums()
or colSums()
in the code for your function. A typical example of use should look like this:
<- matrix(1:24, nrow = 6)
myMatrix dimSum(myMatrix)
[1] 40 44 48 52 56 60
dimSum(myMatrix, "columns")
[1] 21 57 93 129
Hint: Recall that in the practice exercises (Section 7.2.2) we made a function called myRowSums()
that sums the rows of any given matrix. Modify the idea for myRowSums()
to write a function called myColSums()
that finds the column-sums of any given matrix. You may then use the two previously-created functions to write the required function dimSum()
.
Exercise 3
Starting with m111survey
in the bcscr package, write the code necessary to create a new data frame called smaller
that consists precisely of the male students who believe in extraterrestrial life and who are more than 68 inches tall. The new data frame should contain all of the original variables except for sex
and extra_life
.
Exercise 4
Write a function called dfRandSelect()
that randomly selects (without replacement) a specified number of rows from a given data frame. The function should have two parameters:
df
: the data frame from which to select;n
: the number of rows to select.
If n
is greater than the number of rows in df
, the function should return immediately with a message informing the user that the required task is not possible and informing him/her of the number of rows in df
. Typical examples of use should be as follows:
dfRandSelect(bcscr::fuel, 5)
speed efficiency
12 120 9.87
15 150 12.83
7 70 6.30
6 60 5.90
8 80 6.95
dfRandSelect(bcscr::fuel, 200)
No can do! The frame has only 15 rows.
Hint: Use the function nrow()
, which gives the number of rows of a matrix or data frame.
Exercise 5*
Create your own data frame, named myFrame
. The frame should have 100 rows, along with the following variables:
lowerLetters
: a character vector of randomly-produced 3-letter strings, like “chj”, “bbw”, and so on. The letters should all be lowercase.height
: a numerical vector consisting of real numbers chosen randomly between the values of 60 and 75.sex
: a factor whose possible value are “female” and “male”. Again, these values should be chosen randomly.
A call to str(myFrame)
would come out like this (although your results will vary a bit since the vectors are constructed randomly):
str(myFrame)
'data.frame': 100 obs. of 3 variables:
$ lowerLetters: chr "usu" "uhl" "xyj" "uyd" ...
$ height : num 73.7 72.4 73.8 65.2 61.3 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 2 1 1 1 ...
summary()
is useful when working with data frames. Here is how a call to summary(myFrame)
might look:
summary(myFrame)
lowerLetters height sex
Length:100 Min. :60.00 female:57
Class :character 1st Qu.:63.63 male :43
Mode :character Median :68.28
Mean :67.62
3rd Qu.:71.63
Max. :74.57
Hint: If you have a vector of three letters, such as
<- c("g", "a", "r") vec
then you can paste them together as follows:
paste0(vec, collapse = "")
[1] "gar"
Exercise 6*
Study the data frame fuel
in the bcscr package. Note that the fuel efficiency is reported as the number of liters of fuel required to travel 100 kilometers. Look up the conversion between gallons and liters and between kilometers and miles, and use this information to create a new variable called mpg
that gives the fuel efficiency as miles per gallon. While you are at it, create a new variable mph
that gives the speed in miles per hour. Finally, add these new variables to the fuel
data frame.
Exercise 7*
Use matrices to generalize the simulation in the Appeals Court Paradox (see Section 6.6). Your goal is to write a simulation function called appealsSimPlus()
that comes with all the options provided in the text, but with additional parameters so that the user can choose:
- the number of judges on the court;
- the probability for each judge to make a correct decision;
- the voting pattern (how many votes each judge gets).
A typical call to the functions should look like this:
appealsSimPlus(
reps = 10000,
seed = 5252,
probs = c(0.95, 0.90, 0.90, 0.90, 0.80),
votes = c(2, 1, 1, 1, 0)
)
In the above call the court consists of five judges. The best one decides cases correctly 95% of the time, three are right 90% of the time and one is right 80% of the time. The voting arrangement is that the best judge gets two votes, the next three get one vote each, and the worst gets no vote. Any voting scheme—even a scheme involving fractional votes—should be allowed so long as the votes add up to the number of judges.
Here is a hint. When you write the function it may be helpful to use the fact that rbinom()
can take a prob
parameter that is a vector of any length. Here’s an example:
<- rbinom(6, size = 100, prob = c(0.10, 0.50, 0.90))
results results
[1] 20 49 94 15 50 88
The first and fourth entries simulate a person tossing a fair coin 100 times when she has only a 10% chance of heads. The second and fifth entries simulate the same, when the chance of heads is 50%. The third and sixth simulate coin-tossing when there is a 90% chance of heads.
If you would like to arrange the results more nicely—say in a matrix where each column gives the results for a different person—you can do so:
<- matrix(results, ncol = 3, byrow = TRUE)
resultsMat resultsMat
[,1] [,2] [,3]
[1,] 20 49 94
[2,] 15 50 88
Of course judges don’t flip a coin 100 times, they decide one case at a time. Suppose you have five judges with probabilities as follows:
<- c(0.95, 0.90, 0.90, 0.90, 0.80) probCorrect
If you would like to simulate the judges deciding, say, 6 cases, try this:
<- rbinom(5*6, size = 1, prob= rep(probCorrect, 6))
results <- matrix(results, nrow = 6, byrow = TRUE)
resultsMat resultsMat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 0 1
[2,] 0 1 1 1 1
[3,] 1 1 1 1 1
[4,] 1 1 1 1 1
[5,] 1 1 1 1 1
[6,] 1 1 1 1 0
When it comes to applying the voting pattern to compute the decision in each case, consider matrix multiplication. For example, suppose that the pattern is:
<- c(2, 1, 1, 1, 0) votes
Then make votes
a one-column matrix and perform matrix multiplication:
<- resultsMat %*% matrix(votes, nrow = 5)
correctVotes correctVotes
[,1]
[1,] 4
[2,] 3
[3,] 5
[4,] 5
[5,] 5
[6,] 5
Think about how to encapsulate all of this into a nice, general simulation function.
Domain-specific languages (DSLs for short) stand in contrast to general-purpose programming languages that were designed to solve a wide variety of problems. Examples of important general-purpose languages include C and C++, Java, Python and Ruby. Although R is by now the one of the most widely-used DSLs in the world, there a number of other important ones, including Matlab and Otavefor scientific computing, Emacs Lisp for the renowned Emacs editor, and SQL for querying databases. JavaScript is an interesting case: it started out as a DSL for web browsers, but has since expanded to power many web applications and is now being used to develop desktop applications as well.↩︎
As an example outside of programming, consider what happens when you read a piece of literature “for structure.” You begin by asking: “What kind of literature is this? Is it drama, a novel, or something else?” The answer lets you know what to expect as you read: if it’s a novel, you know to suspend disbelief, whereas if it’s a journalistic piece then you know to examine critically whatever it presents as fact. Next, you might outline the piece. When you make an outline, you are breaking the piece up into parts, and indicating how the parts relate to each other to advance the plot and/or message of the piece. Note that in the process of “reading for structure” you are following the pattern of the definition of structure offered above.↩︎