7  Data Frames

Can one be a good data analyst without being a half-good programmer? The short answer to that is, ‘No.’ The long answer to that is, ‘No.’

—Frank Harrell

Up to this point we have given a great deal of attention to vectors, and we have always treated them as one-dimensional objects: a vector has a length, but not a “width.”

It is time to begin working in two dimensions. In this Chapter we will study matrices, which are simply vectors that have both length and width. Matrices are immensely useful for scientific computation in R, but for the most part we will treat them as a warm-up for data frames—the two-dimensional R-objects that are especially designed for the storage of data collected in the course of practical data analysis. Once you understand how to construct and manipulate data frames, you will be ready to learn how to visualize and analyze data using R.

7.1 Introduction to Matrices

In R, a matrix is actually an atomic vector—it can only hold one type of element—but with two extra attributes:

  • a certain number of rows, and
  • a certain number of columns.

One way to create is matrix is to take a vector and give it those two extra attributes, via the matrix() function. Here is an example:

numbers <- 1:24  # this is an ordinary atomic vector
numbersMat <- matrix(numbers, nrow = 6, ncol = 4)  # make a matrix
numbersMat
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24

Of course if you are making a matrix out of 24 numbers and you know that it’s going to have 6 rows, then you know it must have 4 columns. Similarly, if you know the number of columns then the number of rows is determined. Hence you could have constructed the matrix with just one of the row or column arguments, like this:

numbersMat <- matrix(numbers, nrow = 6)

Notice that the numbers went down the first column, then down the second, and so on. If you would rather fill up the matrix row-by-row, then set the byrow parameter, which is FALSE by default, to TRUE:

matrix(numbers, nrow = 6, byrow = TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
[5,]   17   18   19   20
[6,]   21   22   23   24

Sometimes we like to give names to our rows, or to our columns, or even to both:

rownames(numbersMat) <- letters[1:6]
colnames(numbersMat) <- LETTERS[1:4]
numbersMat
  A  B  C  D
a 1  7 13 19
b 2  8 14 20
c 3  9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24

Matrices don’t have to be numerical. They can be character or logical matrices as well:

creatures <- c(
  "Dorothy", "Lion", "Scarecrow", 
  "Oz", "Toto", "Boq"
)
matrix(creatures, ncol = 2)
     [,1]        [,2]  
[1,] "Dorothy"   "Oz"  
[2,] "Lion"      "Toto"
[3,] "Scarecrow" "Boq" 

If you have to spread out the elements of a matrix into a one-dimensional vector, you can do so:

as.vector(numbersMat)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

7.1.1 Practice Exercises

Let’s work with the following vector:

dozen <- letters[1:12]
  1. Starting with dozen write a command that produces the following matrix:
     [,1] [,2] [,3] [,4]
[1,] "a"  "d"  "g"  "j" 
[2,] "b"  "e"  "h"  "k" 
[3,] "c"  "f"  "i"  "l" 
  1. Starting with dozen write a command that produces the following matrix:
     [,1] [,2] [,3]
[1,] "a"  "e"  "i" 
[2,] "b"  "f"  "j" 
[3,] "c"  "g"  "k" 
[4,] "d"  "h"  "l" 
  1. Starting with dozen write a command that produces the following matrix:
     [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
[3,] "g"  "h"  "i" 
[4,] "j"  "k"  "l" 
  1. Starting with dozen, write commands that produce the following matrix:
   c1  c2  c3 
r1 "a" "b" "c"
r2 "d" "e" "f"
r3 "g" "h" "i"
r4 "j" "k" "l"
  1. Suppose you make the following matrix:
smallMat <- matrix(c(8, 5, 3, 4), nrow =2)
smallMat
     [,1] [,2]
[1,]    8    3
[2,]    5    4

What’s a one-line command to get the folowing vector from smallMat?

[1] 8 5 3 4
  1. nrow() is a function that, when given a matrix, will tell you the number of rows in that matrix. Write a one-line command to find the number of rows in a matrix called mysteryMat.

  2. ncol() is a function that, when given a matrix, will tell you the number of columns in that matrix. Write a one-line command to find the number of columns in a matrix called mysteryMat.

7.1.2 Solutions to Practice Exercises

  1. Here’s one way to do it:
matrix(dozen, nrow = 3)

Here’s another way:

matrix(dozen, ncol = 4)
  1. Here’s one way to do it:
matrix(dozen, nrow = 4)
  1. Here’s one way to do it:
matrix(dozen, nrow = 4, byrow = TRUE)
  1. Here’s one way to do it:
answerMatrix <- matrix(dozen, nrow = 4, byrow = TRUE)
rownames(answerMatrix) <- c("r1", "r2", "r3", "r4")
colnames(answerMatrix) <- c("c1", "c2", "c3")
answerMatrix
  1. Here’s how:
as.vector(smallMat)
  1. The command nrow(mysteryMat) will work.

  2. The command ncol(mysteryMat) will work.

7.2 Matrix Indexing

Matrices are incredibly useful in data analysis, but the primary reason we are talking about them now is to get you used to working in two dimensions. Let’s practice sub-setting with matrices.

We use the sub-setting operator [ to pick out parts of a matrix. For example, in order to get the element in the second row and third column of numbersMat, ask for:

numbersMat[2,3]
[1] 14

The row and column numbers are called indices.

If we want the entire second row, then we could ask for:

numbersMat[2,1:4]
 A  B  C  D 
 2  8 14 20 

The result is a one-dimensional vector consisting of the elements in the second row of numbersMat. It inherits as its names the column names of numbersMat.

Actually, if you want the entire row you don’t have to specify which columns you want. Just leave the spot after the comma empty, like this:

numbersMat[2, ]
 A  B  C  D 
 2  8 14 20 

What if you want some items on the second row, but only the items in columns 1, 2 and 4? Then frame your request in terms of a vector of column-indices:

numbersMat[2, c(1, 2, 4)]
 A  B  D 
 2  8 20 

You can specify a vector of row-indices along with a vector of column-indices, if you like:

numbersMat[1:2, 1:3]
  A B  C
a 1 7 13
b 2 8 14

If the vector has row or column names then you may use them in place of indices to make a selection:

numbersMat[, c("B", "D")]
   B  D
a  7 19
b  8 20
c  9 21
d 10 22
e 11 23
f 12 24

You can use sub-setting to change the values of the elements of a matrix

numbersMat[2,3] <- 0
numbersMat
  A  B  C  D
a 1  7 13 19
b 2  8  0 20
c 3  9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24

You can assign a value to an entire row:

numbersMat[2,] <- 0
numbersMat
  A  B  C  D
a 1  7 13 19
b 0  0  0  0
c 3  9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24

In the code above, the 0 was “recycled” into each of the four elements of the second row

You can assign the elements of a vector to corresponding selected elements of a matrix:

numbersMat[2,] <- c(100, 200, 300, 400)
numbersMat
    A   B   C   D
a   1   7  13  19
b 100 200 300 400
c   3   9  15  21
d   4  10  16  22
e   5  11  17  23
f   6  12  18  24

7.2.1 To Drop or Not?

Note that when we ask for a single row of numbersMat we got a regular one-dimensional vector:

numbersMat[3, ]
 A  B  C  D 
 3  9 15 21 

The same things happens if we ask for a single column:

numbersMat[ , 2]
  a   b   c   d   e   f 
  7 200   9  10  11  12 

We get the second column of numbersMat, but as a regular vector. It’s not a “column” anymore. (Note that it inherits the row names from numbersMat.)

When a subset of a matrix comes from only one row or column, R takes the opportunity to “drop” the class of the subset from “matrix” to “vector.” If you would like the subset to stay a vector, set the drop parameter, which by default is TRUE, to FALSE. Thus the second column of numbersMat, kept as a matrix with six rows and one column, is found as follows:

numbersMat[ , 2, drop = FALSE]
    B
a   7
b 200
c   9
d  10
e  11
f  12

In most applications people want the simpler vector structure, so they usually leave drop at its default value.

7.2.2 Practice Exercises

In these exercises we’ll work with the following matrix:

numbers <- 1:40
practiceMatrix <- matrix(numbers, nrow = 4)
rownames(practiceMatrix) <- letters[1:4]
colnames(practiceMatrix) <- LETTERS[1:10]
practiceMatrix
  A B  C  D  E  F  G  H  I  J
a 1 5  9 13 17 21 25 29 33 37
b 2 6 10 14 18 22 26 30 34 38
c 3 7 11 15 19 23 27 31 35 39
d 4 8 12 16 20 24 28 32 36 40
  1. Write two different one-line commands to get this matrix:
  B  C  D  E
a 5  9 13 17
c 7 11 15 19
  1. Write a one-line command to get this matrix:
  A  C  E  G  I
a 1  9 17 25 33
b 2 10 18 26 34
c 3 11 19 27 35
d 4 12 20 28 36
  1. Write a one-line command to get this vector:
a b c d 
1 2 3 4 
  1. Write a one-line command to get this vector:
 A  B  C  D  E  F  G  H  I  J 
 2  6 10 14 18 22 26 30 34 38 
  1. Write a one-line command to get this matrix:
  A
a 1
b 2
c 3
d 4
  1. Write a convenient one-line command to get this matrix:
  A B  C  D  E  F  G  H  I
a 1 5  9 13 17 21 25 29 33
b 2 6 10 14 18 22 26 30 34
c 3 7 11 15 19 23 27 31 35
d 4 8 12 16 20 24 28 32 36
  1. Write a convenient one-line command to get this matrix:
  A  C  D  E  F  G  H  I
a 1  9 13 17 21 25 29 33
b 2 10 14 18 22 26 30 34
c 3 11 15 19 23 27 31 35
d 4 12 16 20 24 28 32 36
  1. Write a function called myRowSums() that will find the sums of the rows of any given matrix. The function should use a for-loop (see the Chapter on Flow Control). The function should take a single parameter called mat, the matrix whose rows the user wishes to sum. It should work like this:
myMatrix <- matrix(1:24, ncol = 6)
myRowSums(mat = myMatrix)
[1] 66 72 78 84

7.2.3 Solutions to Practice Exercises

  1. Here are two ways:
practiceMatrix[c(1,3), 2:5]
practiceMatrix[c("a","c"), 2:5]
  1. Here’s one way:
practiceMatrix[ , seq(1, 9, by = 2)]
  1. Here’s one way:
practiceMatrix[ , 1]
  1. Here’s one way:
practiceMatrix[2, ]
  1. Here’s one way:
practiceMatrix[ , 1, drop = FALSE]
  1. Here’s one way:
practiceMatrix[ , -10]
  1. Here’s one way:
practiceMatrix[ , -c(2, 10)]
  1. Here is one way to write the function:
myRowSums <- function(mat) {
  n <- nrow(mat)
  sums <- numeric(n)
  for (i in 1:n) {
    sums[i] <- sum(mat[i, ])
  }
  sums
}

7.3 Operations on Matrices

Matrices can be involved in arithmetical and logical operations.

7.3.1 Arithmetical Operations

The usual arithmetic operations apply to matrices, operating element-wise. For example, suppose that we have:

mat1 <- matrix(rep(1, 4), nrow = 2)
mat2 <- matrix(rep(2, 4), nrow = 2)

To get the sum of the above two matrices, R adds their corresponding elements and forms a new matrix out of their sums, thus:

mat1 + mat2
     [,1] [,2]
[1,]    3    3
[2,]    3    3

R applies recycling as needed. For example, suppose we have:

mat <- matrix(1:4, nrow = 2)
mat
     [,1] [,2]
[1,]    1    3
[2,]    2    4

In order to multiply each element of mat by 2, we need not create a 2-by-2 matrix of 2’s. We can simply multiply by 2, and R will take care of recycling the 2:

2 * mat
     [,1] [,2]
[1,]    2    6
[2,]    4    8

Or we could subtract 3 from each element of mat:

mat - 3
     [,1] [,2]
[1,]   -2    0
[2,]   -1    1

7.3.2 Logical Operations

Boolean operations apply to matrices element-wise, just as they do to ordinary vectors. The result is a matrix of logical values. For examples, consider the original matrix numbersMat:

numbersMat <- matrix(1:24, nrow = 6)

Suppose we wish to determine which elements of numbersMat are odd. Then we simply ask whether the remainder of an element after division by 2 is equal to 1:

numbersMat %% 2 == 1
      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE  TRUE  TRUE  TRUE
[2,] FALSE FALSE FALSE FALSE
[3,]  TRUE  TRUE  TRUE  TRUE
[4,] FALSE FALSE FALSE FALSE
[5,]  TRUE  TRUE  TRUE  TRUE
[6,] FALSE FALSE FALSE FALSE

We can select elements from a matrix using a Boolean operator, too:

numbersMat[numbersMat %% 2 == 1]
 [1]  1  3  5  7  9 11 13 15 17 19 21 23

Note that the result is an ordinary, one-dimensional vector.

7.3.3 Practice Exercises

We’ll work with the following three matrices:

a <- matrix(c(7, 4, 9, 10), nrow = 2)
a
     [,1] [,2]
[1,]    7    9
[2,]    4   10
b <- matrix(1:4, nrow = 2)
b
     [,1] [,2]
[1,]    1    3
[2,]    2    4
c <- matrix(letters[1:24], nrow = 6, byrow = TRUE)
c
     [,1] [,2] [,3] [,4]
[1,] "a"  "b"  "c"  "d" 
[2,] "e"  "f"  "g"  "h" 
[3,] "i"  "j"  "k"  "l" 
[4,] "m"  "n"  "o"  "p" 
[5,] "q"  "r"  "s"  "t" 
[6,] "u"  "v"  "w"  "x" 
  1. Find a one-line command using a that results in:
     [,1] [,2]
[1,]   10   12
[2,]    7   13
  1. Find a one-line command using a that results in:
     [,1] [,2]
[1,]   14   18
[2,]    8   20
  1. Find a one-line command using a that results in:
     [,1] [,2]
[1,]   49   81
[2,]   16  100
  1. Find a one-line command using a and b that results in:
     [,1] [,2]
[1,]    6    6
[2,]    2    6
  1. Describe in words what the following command does:
a > 5
  1. Write a one-line command using a that tells you which elements of a are one more than a multiple of 3.

  2. Using c, write a one-line boolean expression that produces the following:

      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE  TRUE
[3,]  TRUE  TRUE  TRUE  TRUE
[4,]  TRUE  TRUE  TRUE  TRUE
[5,]  TRUE  TRUE  TRUE  TRUE
[6,]  TRUE  TRUE  TRUE  TRUE

7.3.4 Solutions to Practice Exercises

  1. Here’s one way:
a + 3
  1. Here’s one way:
2 * a
  1. Here’s one way:
a^2
  1. Here’s one way:
a - b
  1. It produces a logical matrix of the same dimensions as a. The new matrix will have TRUE in a cell when the corresponding cell of a is greater than 5. Otherwise, the cell will have FALSE in it.

  2. Here’s one way:

a %% 3 == 1
  1. Here’s one way:
c >= "h"

7.4 Introduction to Data Frames

R is sometimes spoken of as a domain-specific programming language, meaning that although it can in principle perform any sort of computation that a human can perform (given enough pencil, paper and time), it was originally designed to perform tasks in a particular area of application. R’s original area of application is data analysis and statistics, especially when performed interactively—i.e., in a setting where the analyst asks for a relatively small computation, examines the results, modifies his or her requests and asks again, and so on.1 Although R can be used effectively for a wide range of programming tasks, data analysis is where it really shines.

The data structures of R reflect its orientation to data analysis. We have met a data-oriented structure already—the table, which is one of many convenient ways to display the results of data analysis. For the purpose of organizing data in preparation for analysis, R provides the structure known as the data frame. A data frame facilitates the storage of related data in one location, in a form that makes the most sense to human users.

A data frame is like a matrix in that it is two-dimensional—it has rows and columns. Unlike a matrix, though, the elements of a data frame do not have to be all of the same data-type. Each column of a data frame is a vector—of the same length as all the others—but these vectors may be of different types: some numerical, some logical, etc.

7.4.1 Viewing a Data Frame

Let’s take a close look at a data frame: the frame m111survey, which is available from the bcscr package (White 2024).

Description

 Results of a survey of MAT 111 students at Georgetown College.
 
         • height.  How tall are you, in inches?
 
         • ideal_ht.  A numeric vector How tall would you LIKE to be, in
           inches?
 
         • sleep.  How much sleep did you get last night?
 
         • fastest.  What is the highest speed at which you have ever
           driven a car?
 
         • weight_feel.  How do you feel about your weight?
 
         • love_first.  Do you believe in love at first sight?
 
         • extra_life.  Do you believe in extraterrestrial life?
 
         • seat.  When you have a choice, where do you prefer to sit in
           a classroom?
 
         • GPA.  What is your college GPA?
 
         • enough_Sleep.  Do you think you get enough sleep?
 
         • sex.  What sex are you?
 
         • diff.  Your ideal height minus your actual height.

View

Data Table 7.1

To learn about the data frame in an R session, we would first attach the package itself:

library(bcscr)

In the R Studio IDE, we can get a look at the frame in a tab in the Editor pane if we use the View() function:

View(m111survey)

As with many objects provided by a package, we can get more information about it:

help("m111survey")

From the Help we see that m111survey records the results of a survey conducted in a number of sections of an elementary statistics course at Georgetown College. From the View we see that the frame is arranged in rows and columns. Each row corresponds to what in data analysis is known as a case or an individual: here, each row goes with a student who participated in the survey. The columns correspond to variables: measurements made on each individual. For a student on a given row, the values in the columns are the values recorded for that student.

When you are not working in R Studio, there are still a couple of ways to view the frame. You could print it all out to the console:

m111survey

You could also use the head() function to view a specified number of initial rows:

head(m111survey, n = 6)  # see first six rows

7.4.2 The Stucture of a Data Frame

Further information about the frame may be obtained with the str() function:

str(m111survey)
'data.frame':   71 obs. of  12 variables:
 $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
 $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
 $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
 $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
 $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
 $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
 $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
 $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
 $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
 $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
 $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

The concept of structure extends far beyond the domain of computer programming.2 In general the structure of any object consists of:

  • the kind of thing that the object is;
  • the parts of the object is made up of;
  • the relationships between these parts—the rules, if you will, for how the parts work together to make the object do what it does.

In the case of m111survey the kind of thing this is its class: it’s a data frame.

class(m111survey)
[1] "data.frame"

Next we see the account of the parts of the object and the way in which the parts relate to one another:

71 obs. of  12 variables

From this we know that there are 71 individuals in the study. The data consists of 12 “parts”—the variables—which are related in the sense that they all provide information about the same set of 71 people.

After that the output of str() launches into an account of the structure of each of the parts, for example:

$ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...

We are told the kind of thing that height is: it’s a numerical vector (a vector of type double, in fact). Next we are given the beginning of a statement of its parts: the heights of the individuals. So R is actually giving us the structure of the parts, as well as of the whole m11survey.

The variable fastest refers to the fastest speed—in miles per hour—that a person has ever driven a car. Note that it is a vector of type integer. Officially this is a numerical variable, too, but R is calling attention to the fact that the fastest-speed data is being stored as integers rather than as floating-point decimals.

The variables of a data frame are typically associated with the names of the frame:

names(m111survey)
 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."

By means of the names we can isolate a vector in any column, identified in our code in the format frame$variable. For example, to see the first ten elements of the fastest variable, we ask for:

m111survey$fastest[1:10]
 [1] 119 110  85 100  95 100  85 160  90  90

In order to compute the mean fastest speed our subjects drove their cars, we can ask for:

mean(m111survey$fastest, na.rm = TRUE)
[1] 105.9014

If you want to see the speeds that are at least 150 miles per hour, you could ask for:

m111survey$fastest[m111survey$fastest >= 150]
[1] 160 190

If you worry that the form frame$variable will require an annoying amount of typing—as seems to be the case in the the example above—then you can use the with() function:

with(m111survey, fastest[fastest >=150])
[1] 160 190

It’s instructive to consider how with() works. If we were to includes the names of the parameters of with() explicitly, then the call would have looked like this:

with(data = m111survey, expr = fastest[fastest >=150])

For the data parameter we can supply a data frame or any other R-object that can be used to construct an environment . In this case m111survey provides a miniature environment consisting of the names of its variables. For the expr parameter we supply an expression for R to evaluate. As R evaluates the expression, it encounters names (such as fastest). Now ordinarily R would first search whatever counts as the active environment—in this case it’s the Global Environment—for the names in the expression, but with() forces R to look first within the environment created by the data argument. In our example, R finds fastest inside m111survey and evaluates the expression on that basis. If it had not found fastest in m111survey, R would have moved on to the Global Environment and then the rest of the usual search path and (probably) would have found nothing, causing it to throw an “object not found” error message. In R, as in any other programming language, good programming depends very much on paying attention to how the language searches for the objects to which names refer.

7.4.3 Factors

Some of the variables in m111survey are called factors; an example is seat, which pertains to where one prefers to sit in a classroom:

str(m111survey$seat)
 Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...

Seating preference is an example of a categorical variable: one whose values are not meaningfully expressed in terms of numbers. When a categorical variable has a relatively small number of possible values, it can be convenient to store its values in a vector of class factor.

The levels of factor variable are its possible values. In the case of seat, these are: Front, Middle and Back. As a memory-saving measure, R stores the values in the factor as numbers, where 1 stands for the first level, 2 for the second level, and so on. But please bear in mind that we are dealing with a categorical variable, so the numbers don’t relate to the possible values in any natural way: they are just storage conventions.

It’s possible to create a factor from any type of vector, but most often this is done with a character vector. Suppose for instance, that eight people are asked for their favorite Wizard of Oz character and they answer:

ozFavs <- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
            "Glinda", "Scarecrow", "Dorothy")

We can create a factor variable as follows:

factorFavs <- factor(ozFavs)
factorFavs
[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Dorothy Glinda Scarecrow Toto

Note that the levels are given in alphabetical order: this is the default procedure when R creates a factor. It is possible to ask for a different order, though:

factor(ozFavs, levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))
[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Toto Scarecrow Glinda Dorothy

In many instances it is appropriate to convert a character vector to a factor, but sometimes this is not such a great idea. Consider something like your address, or your favorite inspirational quote: pretty much every person in a study will have a different address or favorite quote than others in the study. Hence there won’t be any memory-storage benefit associated with creating a factor: the vector of levels—itself a character vector—would require as much storage space as the original character vector itself! In addition, we will see that the status of a variable as class “factor” can affect how R’s statistical and graphical functions deal with it. It’s not a good idea to treat a categorical variable as a factor unless its set of possible values is considered important.

We will think more about how to deal with factor variables later on, when we begin data analysis in earnest.

7.4.4 Practice Exercises

Description

 This data table is modifed slightly from mosaicData::RailTrail, (see
 http://cran.r-project.org/web/packages/mosaicData/mosaicData.pdf).
 Description below is drawn from the mosaicData help file.

Format

 A data frame with 90 observations on the following variables.
 
         • hightemp daily high temperature (in degrees Fahrenheit)
 
         • lowtemp daily low temperature (in degrees Fahrenheit)
 
         • avgtemp average of daily low and daily high temperature (in
           degrees Fahrenheit)
 
         • season spring, summer or fall
 
         • cloudcover measure of cloud cover (in oktas)
 
         • precip measure of precipitation (in inches)
 
         • volume estimated number of trail users that day (number of
           breaks recorded)
 
         • weekday logical indicator of whether the day was a
           non-holiday weekday
 
         • dayType one of "weekday" or "weekend"

Details

 The Pioneer Valley Planning Commission (PVPC) collected data north of
 Chestnut Street in Florence, MA for ninety days from April 5, 2005 to
 November 15, 2005. Data collectors set up a laser sensor, with breaks
 in the laser beam recording when a rail-trail user passed the data
 collection station.
 
 There is a potential for error when two users trigger the infrared beam
 at exactly the same time since the counter would only logs one of the
 crossings.  The collectors left the motion detector out during the
 winter, but because the counter drops data when the temperature falls
 below 14 degrees Fahrenheit, there is no data for the cold winter
 months.

View

Data Table 7.2
  1. In an R session, how would you learn more about the data frame railtrail from the bcscr package?

  2. Write a one-line command to see the first 10 rows of railtrail in the Console.

  3. Write a one-line command to get the names of all of the variables in railtrail.

  4. Regarding railtrail: write a one-line command to get the high temperature on all the days when the precipitation was more than 0.5 inches.

  5. Regarding railtrail: write a one-line command to sort the average temperatures from highest to lowest.

7.4.5 Solutions to the Practice Exercises

  1. One way is to attach the package, then then ask for help:
library(bcscr)
help(railtrail)

Another way is to refer to the data frame through the package, with double colons:

help("railtrail", package = "bcscr")

That way you don’t have to add all the items in bcscr to your search path.

  1. Here’s one way:
head(bcscr::railtrail, n = 10)
  1. It’s names(bcscr::railtrail).

  2. Here’s one way:

with(bcscr::railtrail, hightemp[precip > 0.5])
  1. Try this:
sort(bcscr::railtrail$avgtemp, decreasing = TRUE)

7.5 Creating Data Frames

There are many ways to create data frames in R. Here we will introduce just two ways.

7.5.1 Creation from Vectors

Whenever you have vectors of the same length, you can combine them into a data frame, using the data.frame() function:

n <- c("Dorothy", "Lion", "Scarecrow")
h <- c(58, 75, 69)
a <- c(12, 0.04, 18)
ozFolk <- data.frame(name = n, height = h, age = a)
ozFolk
       name height   age
1   Dorothy     58 12.00
2      Lion     75  0.04
3 Scarecrow     69 18.00

Note that at the time of creation you can provide the variables with any names that you like. If later on you change your mind about the names, you can always revise them:

names(ozFolk)  
[1] "name"   "height" "age"   
names(ozFolk)[2] <- "Height"  # "height" was at index 2"
ozFolk
       name Height   age
1   Dorothy     58 12.00
2      Lion     75  0.04
3 Scarecrow     69 18.00

7.5.2 Creation From Other Frames

If two frames have the same number of rows, you may combine their columns to form a new frame with the cbind() function:

ozMore <- data.frame(
  color = c("blue", "red", "yellow"),
  desire = c("Kansas", "courage", "brains")
)
cbind(ozFolk, ozMore)
       name Height   age  color  desire
1   Dorothy     58 12.00   blue  Kansas
2      Lion     75  0.04    red courage
3 Scarecrow     69 18.00 yellow  brains

Similarly if two data frames have the same number and type of columns then we can use the rbind() function to combine them:

ozFolk2 <- data.frame(
  name = c("Toto", "Glinda"),
  Height = c(12, 66), age = c(3, 246)
)
rbind(ozFolk, ozFolk2)
       name Height    age
1   Dorothy     58  12.00
2      Lion     75   0.04
3 Scarecrow     69  18.00
4      Toto     12   3.00
5    Glinda     66 246.00

Note: cbind() and rbind() work for matrices, too.

7.6 Subsetting Data Frames

Our study of sub-setting matrices can be applied to the selection of parts of a data frame. As with a vector, one or both of the dimensions of the frame can come into play.

We can create a new data frame consisting of any columns we like from the original frame:

df <- m111survey[, c("height", "ideal_ht")]
head(df)
  height ideal_ht
1   76.0       78
2   74.0       76
3   64.0       NA
4   62.0       65
5   72.0       72
6   70.8       NA

If we select just one column, then the result is a vector rather than a data frame:

df <- m111survey[, "height"]
is.vector(df)
[1] TRUE

If for some reason you want to prevent this, set drop to FALSE:

df <- m111survey[, "height", drop =FALSE]
head(df)
  height
1   76.0
2   74.0
3   64.0
4   62.0
5   72.0
6   70.8

You may select particular rows, too:

m111survey[10:15, c("height", "ideal_ht")]
   height ideal_ht
10     67       67
11     65       69
12     62       62
13     59       62
14     78       75
15     69       72

You can even select some of the rows at random. Here is a random sample of size six:

n <- nrow(m111survey)
df <- m111survey[sample(1:n, size = 6, replace = FALSE), ]
df[c("sex", "seat")]  # show just two columns
      sex     seat
13 female  1_front
54   male 2_middle
56   male   3_back
28 female  1_front
53 female   3_back
46 female 2_middle

Note the function nrow() that gives the number of rows of the frame. When we sample six items without replacement from the vector 1:n, we are picking six numbers at random from the row-numbers of the vector. Specifying these six numbers in the selection operator [ yields the desired random sample of rows.

7.6.1 Boolean Expressions

It is especially common to select rows by the values of a logical vector. For example, to select the rows where the fast speed ever driven is at least 150 miles per hour, try this:

df <- m111survey[m111survey$fastest >= 150, ]
df[, c("sex", "fastest")]  # show just two of the variables
    sex fastest
8  male     160
32 male     190

When you are selecting rows it can be convenient to use the subset() function. The first argument to the function is the frame from which you plan to select, and the second is the Boolean expression by which to select:

df <- subset(m111survey, fastest >= 150)
df[, c("sex", "fastest")] 
    sex fastest
8  male     160
32 male     190

Note that we did not need to type m111survey$fastest: the first argument to subset() provides the environment in which to search for names that appear in the Boolean expression.

The Boolean sub-setting expressions can be quite complex:

df <- subset(m111survey, seat == "3_back" & height < 72 & sex == "female")
df[, c("sex", "height", "seat")]
      sex height   seat
9  female     59 3_back
20 female     65 3_back
30 female     69 3_back
53 female     69 3_back
70 female     65 3_back

Note: subset() takes a third parameter called select that allows you to pick out any desired columns. For example:

subset(m111survey, seat == "3_back" & height < 72 & sex == "female",
       select = c("sex", "height", "seat"))
      sex height   seat
9  female     59 3_back
20 female     65 3_back
30 female     69 3_back
53 female     69 3_back
70 female     65 3_back

7.6.2 Practice Exercises

We’ll use the CPS85 data frame from the mosaicData package. You should go ahead and load the package and then read about the data frame:

library(mosaicData)
?CPS85

Description

 The Current Population Survey (CPS) is used to supplement census
 information between census years. These data consist of a random sample
 of persons from the CPS85, with information on wages and other
 characteristics of the workers, including sex, number of years of
 education, years of work experience, occupational status, region of
 residence and union membership.

Format

 A data frame with 534 observations on the following variables.
 
 'wage' wage (US dollars per hour)
 
 'educ' number of years of education
 
 'race' a factor with levels 'NW' (nonwhite) or 'W' (white)
 
 'sex' a factor with levels 'F' 'M'
 
 'hispanic' a factor with levels 'Hisp' 'NH'
 
 'south' a factor with levels 'NS' 'S'
 
 'married' a factor with levels 'Married' 'Single'
 
 'exper' number of years of work experience (inferred from 'age' and
           'educ')
 
 'union' a factor with levels 'Not' 'Union'
 
 'age' age in years
 
 'sector' a factor with levels 'clerical' 'const' 'manag' 'manuf'
           'other' 'prof' 'sales' 'service'

View

Data Table 7.3

Each row in the data frame corresponds to an employee in the survey.

  1. Write a command that gives the number of employees in the data frame.

  2. Select the employees who are between 40 and 50 years old.

  3. Select the employees who are married and have fewer than 30 years of experience.

  4. Select the nonunion employees who either live in the South or who have more than 12 years of education (or both).

  5. Select the employees who work in the clerical, construction, management or professional sector.

  6. Select the employees who make more than 30 dollars per hour, and keep only their wage, sex and sector of employment

  7. Select 10 employees at random, keeping only their wage and sex.

  8. Select all of the employees, keeping all information about them except for their union status and whether or not they are from the South.

7.6.3 Solutions to Practice Exercises

  1. The command is nrow(CPS85).

  2. Try this:

subset(CPS85, age > 40 & age < 50)
  1. Try this:
subset(CPS85, married == "Married" & exper < 30)
  1. Try this:
subset(CPS85, union == "Not" & (south == "S" | educ > 12))
  1. Try this:
subset(CPS85, sector %in% c("clerical", "construction",
                            "management", "professional"))
  1. Try this:
CPS85[CPS85$wage > 30, c("wage", "sex", "sector")]
  1. Try this:
CPS85[sample(1:nrow(CPS85), size = 10, replace = FALSE),
      c("wage", "sex")]
  1. Try this (south and union are columns 6 and 9, respectively):
CPS85[ , -c(6, 9)]

The select parameter of the subset() function has a little known feature that allows you to specify columns to omit by name, so the following is another solution:

subset(CPS85, select = -c(south, union))

7.7 New Variables from Old

Quite often you will want to transform one or more variables in a data frame. Transforming a variable means changing its values in a systematic way.

For example, you might want to measure height in feet rather than inches. Then you want the following

heightInFeet <- with(m111survey, height/12)  # 12 inches in a foot

If you plan to use this new variable in your analysis later on, it might be a good idea to add it to the data frame:

m111survey$height_ft <- heightInFeet

Another common need is to recode the values of a categorical variable. For example, you might want to divide people into two groups: those who prefer to sit in the back and those who don’t. This is a good time to use ifelse():

seat2 <- ifelse(m111survey$seat == "3_back", "Back", "Other")
m111survey$seat2 <- seat2

If you plan to re-code into a variable that involves more than two values, then you might want to look into the mapvalues() function from the plyr package (Wickham 2023):

seat3 <- plyr::mapvalues(
  m111survey$seat,
  from = c("1_front", "2_middle", "3_back"),
  to = c("Front", "Middle", "Back")
)
str(seat3)
 Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...

Another common transformation involves turning a numerical variable into a factor. For example, we might need to classify people as:

  • Tall (height over 70 inches)
  • Medium (65 - 70 inches)
  • Short (less than 65 inches)

The cut() function will be helpful.

Listing 7.1: An illustraion of the cut() function
heightClass <- cut(
  m111survey$height,
  breaks = c(-Inf, 65, 70, Inf),
  labels = c("Short", "Medium","Tall"),
  right = TRUE
)
str(heightClass)
 Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 2 3 1 2 ...

Setting right = TRUE indicates that the upper bound of each interval is included in the interval. Thus, a person with a height of 70 inches is classed as Medium, not Tall.

7.7.1 Getting Rid of Variables

We have added several variables to m111survey. In order to remove them (or any other variables we don’t want) we can assign them the value NULL.

names(m111survey)
 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."
[13] "height_ft"       "seat2"          
m111survey$height_ft <- NULL
m111survey$seat2 <- NULL
m111survey$seat3 <- NULL
names(m111survey)  # the extra variables are gone
 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."

7.7.2 Practice Exercises

  1. Remove the variables hispanic and married from the mosaicData::CPS85 data frame.

  2. Change the units of wage in mosaicData::CPS85 from dollars per hour to dollars per day. Assume an eight-hour working day.

  3. For CPS85, create a new variable experGrp that has the following values

  • low for experience less than 10 years;
  • medium for experience of at least 10 years but less than 25 years;
  • high for experience at least 25 years.
  1. Using the experGrp variable in the previous exercise, create the following tally of the ages of the employees:
experGrp
   low medium   high 
   179    217    138 
  1. You’ve made some changes to CPS85, but in fact you haven’t changed the original data frame in the mosaicData package—you’ve simply made your own copy, which should now be in your Global Environment. Since the Global Environment comes before any package on your search path, if you want to get to the original CPS85 you will either have to refer to it as mosaicData::CPS85. Another option, though, is to remove the modified copy from your Global Environment. Go ahead and remove it now.

7.7.3 Solutions to Practice Exercises

  1. Here’s one way to do it:
CPS85$hispanic <- NULL
CPS85$married <- NULL
  1. Here’s one way to do it:
CPS85$wage <- CPS85$wage * 8
  1. Here’s one way to do it:
CPS85$experGrp <- cut(
  CPS85$exper,
  breaks = c(-Inf, 10, 25, Inf),
  labels = c("low", "medium", "high")
)
  1. Use table(CPS85$experGrp).

  2. Here’s what to do:

rm(CPS85)

7.8 More in Depth

7.8.1 Matrix Multiplication

This section may interest you if you know about matrix multiplication in linear algebra.

In order to accomplish matrix multiplication, we have to keep in mind that the regular multiplication operator * works element-wise on matrices, as we have already seen. For matrix multiplication R provides the special operator %*%. For example, consider the following matrices:

a <- matrix(1:6, ncol = 3)
a
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
b <- matrix(c(2, 1, -1), nrow = 3)
b
     [,1]
[1,]    2
[2,]    1
[3,]   -1

Observe that the number of columns of a is equal to the number of rows of b. Hence it is possible to form the matrix product a %*% b:

a %*% b
     [,1]
[1,]    0
[2,]    2

As expected, the result is a matrix having as many rows as the rows of aand as many columns as the columns of b.

It is also interesting to recall how matrix multiplication works when the second matrix has only one column. The product is obtained by multiplying each column of a by the element on the corresponding row of b, and adding the resulting matrices:

b[1,1]*a[ ,1, drop = FALSE] + b[2,1, drop = FALSE]*a[ ,2] + b[3,1]*a[ ,3, drop = FALSE]
     [,1]
[1,]    0
[2,]    2

7.8.2 Ordering Data Frames

You can reorder as well as select. For example, the following code selects the first five rows ofm111survey and then reverses them:

df <- m111survey[, c("height", "ideal_ht")]
dfRev <- df[5:1, ]
head(dfRev)
  height ideal_ht
5     72       72
4     62       65
3     64       NA
2     74       76
1     76       78

If you want, you can even scramble the rows of the data frame in a random order:

n <- nrow(m111survey)
shuffle <- sample(1:n, size = n, replace = FALSE)
df <- m111survey[shuffle, ]
head(df[c("sex", "seat")])  #show just two columns
      sex     seat
25 female 2_middle
51 female 2_middle
69 female  1_front
52 female 2_middle
64   male   3_back
13 female  1_front

It is quite common to order the rows of a frame according to the values of a particular variable. For example, you might want to arrange the rows by height, so that the frame begins with the shortest subject and ends with the tallest.

Accomplishing this task requires a study of R’s order() function. Consider the following vector:

vec <- c(15, 12, 23, 7)

Call order() with this vector as an argument:

order(vec)
[1] 4 2 1 3

order() returns the indices of the elements of vec, in the following order:

  • the index of the smallest element (7, at index 4 of vec);
  • the index of the second-smallest element (12, at index 2 of vec);
  • the index of the third-smallest element (15, at index 1 of vec);
  • the index of the largest element (23, at index 3 of vec).

Can you guess the output of the following function-call without looking for the answer underneath?

vec[order(vec)]
[1]  7 12 15 23

Sure enough, the result is vec sorted: from smallest to largest element.

Now the sorting of vec could have been accomplished with R’s sort()function:

sort(vec)
[1]  7 12 15 23

The power of order() comes with the rearrangement of rows of a data frame. In order to “sort” the frame from shortest to tallest subject, call:

df <- m111survey[order(m111survey$height), ]
head(df[, c("sex", "height")])  # to show that it worked
      sex height
45 female     51
26 female     54
9  female     59
13 female     59
40 female     60
69 female     61

If you want to order the rows from tallest to shortest instead, then use the decreasing parameter, which by default is FALSE:

df <- m111survey[order(m111survey$height, decreasing = TRUE), ]
head(df[, c("sex", "height")])  # to show that it worked
      sex height
8    male     79
14 female     78
1    male     76
58   male     76
34   male     75
54   male     75

Sometimes you want to order by two or more variables. For example suppose you want to arrange the frame so that the folks preferring to sit in front come first, followed by the people who prefer the middle and ending with the people who prefer the back. Within these groups you would like people to be arranged from shortest to tallest. Then call:

ordering <- with(m111survey, order(seat, height))
df <- m111survey[ordering, ]
head(df[, c("seat", "height")], n = 10)  # see if it worked
      seat height
45 1_front     51
26 1_front     54
13 1_front     59
69 1_front     61
4  1_front     62
12 1_front     62
23 1_front     63
38 1_front     63
61 1_front     63
57 1_front     64

7.8.3 Combining With rbind() and cbind()

If two matrices have the same number of rows, then you can bind their columns together to create a new matrix, using the cbind() function:

lowercase <- matrix(letters, nrow = 13)
uppercase <- matrix(LETTERS, nrow = 13)
both_cases <- cbind(lowercase, uppercase)
both_cases
      [,1] [,2] [,3] [,4]
 [1,] "a"  "n"  "A"  "N" 
 [2,] "b"  "o"  "B"  "O" 
 [3,] "c"  "p"  "C"  "P" 
 [4,] "d"  "q"  "D"  "Q" 
 [5,] "e"  "r"  "E"  "R" 
 [6,] "f"  "s"  "F"  "S" 
 [7,] "g"  "t"  "G"  "T" 
 [8,] "h"  "u"  "H"  "U" 
 [9,] "i"  "v"  "I"  "V" 
[10,] "j"  "w"  "J"  "W" 
[11,] "k"  "x"  "K"  "X" 
[12,] "l"  "y"  "L"  "Y" 
[13,] "m"  "z"  "M"  "Z" 

If two matrices have the same number of columns, then you can bind their rows together, with rbind():

lowercase2 <- matrix(letters, ncol = 13)
uppercase2 <- matrix(LETTERS, ncol = 13)
both_cases2 <- rbind(lowercase2, uppercase2)
both_cases2
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a"  "c"  "e"  "g"  "i"  "k"  "m"  "o"  "q"  "s"   "u"   "w"   "y"  
[2,] "b"  "d"  "f"  "h"  "j"  "l"  "n"  "p"  "r"  "t"   "v"   "x"   "z"  
[3,] "A"  "C"  "E"  "G"  "I"  "K"  "M"  "O"  "Q"  "S"   "U"   "W"   "Y"  
[4,] "B"  "D"  "F"  "H"  "J"  "L"  "N"  "P"  "R"  "T"   "V"   "X"   "Z"  

rbind() and cbind() work with data frames, too. Here, we use rbind() to add a new row to a data frame:

n <- c("Dorothy", "Scarecow", "Lion")
h <- c(58, 75, 69)
a <- c(12, 0.04, 18)
oz_folk <- data.frame(name = n, height = h, age = a)
one_more_person <- data.frame(
  name = "Tin Man",
  height = 72,
  age = 24
)
all_together <- rbind(oz_folk, one_more_person)
all_together
      name height   age
1  Dorothy     58 12.00
2 Scarecow     75  0.04
3     Lion     69 18.00
4  Tin Man     72 24.00

We can add new columns as well:

new_properties <- data.frame(
  desire = c("Kansas", "brains", "courage", "a heart"),
  fav_color = c("crimson", "blue", "burlywood", "orange")
)
cbind(all_together, new_properties)
      name height   age  desire fav_color
1  Dorothy     58 12.00  Kansas   crimson
2 Scarecow     75  0.04  brains      blue
3     Lion     69 18.00 courage burlywood
4  Tin Man     72 24.00 a heart    orange

7.8.4 Practice Exercises

  1. Consider the following vector:
creatures <- c("Mole", "Frog", "Rat", "Badger")

Write down what you think will be the result of the call:

order(creatures)

Then check your answer by actually running:

creatures <- c("Mole", "Frog", "Rat", "Badger")
order(creatures)
  1. What will be the result of the following?
order(creatures, decreasing = TRUE)
  1. Arrange the rows of the data frame mosaicData::CPS85 in order, from the lowest to the highest wage. Break ties by experience (less experience coming before more experience).

  2. Arrange the rows of the data frame mosaicData::CPS85 in order, from the lowest to the highest wage. Break ties by experience (more experience coming before less experience).

  3. Review the all_walk() function from Section @ref(nested-loops). Write a function called all_walk_df() that, instead of returning the total number of flowers picked, returns a data frame that records the sequence of flowers picked by each person. You may omit the option for a report along the way. Recall that the colors of the flowers in the field were:

flower_colors <- c("blue", "red", "pink", "crimson", "orange")

A typical example of use would be:

all_walk_df(
  people = c("Dorothy", "Scarecrow"),
  favs = c("crimson", "blue"),
  numbers = c(2, 1)
)
       name  flower
1   Dorothy  orange
2   Dorothy crimson
3   Dorothy crimson
4 Scarecrow     red
5 Scarecrow    blue

7.8.5 Solutions to Practice Exercises

  1. Here’s what you get:
order(creatures)
[1] 4 2 1 3
  1. Here’s what you get:
order(creatures, decreasing = TRUE)
[1] 3 1 2 4
  1. Here is one way:
CPS85[order(CPS85$wage, CPS85$exper), ]
  1. Here is one way:
CPS85[order(CPS85$wage, CPS85$exper, 
            decreasing = c(FALSE, TRUE)), ]
  1. Try this:
## helper-function to make df for one person:
walk_meadow_df <- function(person, color, wanted) {
  picking <- TRUE
  ## the following will be extended to hold the flowers picked:
  flowers_picked <- character()
  desired_count <- 0
  while (picking) {
    picked <- sample(flower_colors, size = 1)
    flowers_picked <- c(flowers_picked, picked)
    if (picked == color) desired_count <- desired_count + 1
    if (desired_count == wanted) picking <- FALSE
  }
  ## return the data frame:
  data.frame(
    name = rep(person, times = length(flowers_picked)),
    flower = flowers_picked
  )
}

all_walk_df <- function(people, favs, numbers) {
  ## start with a data frame with 0 rows
  ## and columns named correctly:
  df <- data.frame(
    name = character(),
    flower = character()
  )
  for (i in 1:length(people)) {
    person <- people[i]
    fav <- favs[i]
    number <- numbers[i]
    person_df <- walk_meadow_df(
      person = person,
      color = fav,
      wanted = number
    )
    ## extend df:
    df <- rbind(df, person_df)
  }
  ## return the complete data frame:
  df
}

The Main Ideas of This Chapter

  • Matrices are atomic vectors, with two additional attributes: number of rows, and number of columns.
  • Since matrices are vectors, you can subset them with the [-operator. You just have to account for rows and columns with a separating comma (e.g., myMatrix[3, 5]).
  • If you subset a matrix to get just one row or one column, then the result is “dropped” to an ordinary vector, unless you set drop to FALSE.
  • Arithmetic operations work pairwise on matrices, just like they do on vectors.
  • Like matrices, data frames are two dimensional, but their columns do not have to be all the same type of atomic vector.
  • You can access a column in a data frame with the $-operator (e.g., m111survey$fastest).
  • Subsetting data frames can be done with the [-operator, just like matrices.
  • You can also subset data frames with the subset() function.

Glossary

Matrix

An atomic vector that has two additional attributes: a number of rows and a number of columns.

Data Frame

A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.

Case (also called an Individual)

An individual unit under study. In a data frame in R, the rows correspond to cases.

Variable (in Data Analysis)

In data analysis, a variable is a measurement made on the individuals in a study.

Categorical Variable (in Data Analysis)

In data analysis, a categorical variable is a variable whose values cannot be expressed meaningfully by numbers.

Exercises

Exercise 1

  1. R has a function called t() that computes the transpose of a given matrix. This means that it switches around the rows and columns of the matrix, like this:
myMatrix <- matrix(1:24, nrow = 6)
myMatrix
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24
t(myMatrix)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24

Write your own function called transpose() that will perform the same task on any given matrix. The function should take a single parameter called mat, the matrix to be transposed. Of course you may NOT use t() in the code for your function!

Hint: Let’s solve the problem in a general way, on an example.

First, we set up an example, naming it mat because that’s the required name of the parameter in the function we are supposed to write:

mat <- matrix(1:12, nrow = 2)

Here is mat:

mat
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    3    5    7    9   11
[2,]    2    4    6    8   10   12

Next, we break mat down into just the vector of its elements:

elements <- as.vector(mat)

Let’s take a look at the elements:

elements
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

Recall that our target is this matrix:

t(mat)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
[6,]   11   12

So we want to put the elements back into a matrix that has 2 rows and six columns. We need to do this in a general way:

matrix(elements, nrow = ncol(mat))
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    4   10
[5,]    5   11
[6,]    6   12

This got the right number of rows and columns, but the elements need to be filled in across rows, not down columns, so instead let’s try:

matrix(elements, nrow = ncol(mat), byrow = TRUE)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
[6,]   11   12

That worked!

So after we set up the example mat, the “work” we need to do is:

elements <- as.vector(mat)
matrix(elements, nrow = ncol(mat), byrow = TRUE)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
[6,]   11   12

You take it from here: encapsulate this work into the required function, and test it on some examples.

Exercise 2

R has functions called rowSums() and colSums() that will respectively sum the rows and the columns of a matrix. Here is an example:

myMatrix <- matrix(1:24, nrow = 6)
rowSums(myMatrix)
[1] 40 44 48 52 56 60

Your task is to write your own function called dimSum() that will sum either the rows or the columns of a given matrix. The function should have two parameters:

  • mat: the matrix to be summed.
  • dim: the dimension to sum along, either rows or columns. The default value should be "rows". If the user sets dim to "columns" then the function would compute the column-sums.

You may NOT use rowSums() or colSums() in the code for your function. A typical example of use should look like this:

myMatrix <- matrix(1:24, nrow = 6)
dimSum(myMatrix)
[1] 40 44 48 52 56 60
dimSum(myMatrix, "columns")
[1]  21  57  93 129

Hint: Recall that in the practice exercises (Section 7.2.2) we made a function called myRowSums() that sums the rows of any given matrix. Modify the idea for myRowSums() to write a function called myColSums() that finds the column-sums of any given matrix. You may then use the two previously-created functions to write the required function dimSum().

Exercise 3

Starting with m111survey in the bcscr package, write the code necessary to create a new data frame called smaller that consists precisely of the male students who believe in extraterrestrial life and who are more than 68 inches tall. The new data frame should contain all of the original variables except for sex and extra_life.

Exercise 4

Write a function called dfRandSelect() that randomly selects (without replacement) a specified number of rows from a given data frame. The function should have two parameters:

  • df: the data frame from which to select;
  • n: the number of rows to select.

If n is greater than the number of rows in df, the function should return immediately with a message informing the user that the required task is not possible and informing him/her of the number of rows in df. Typical examples of use should be as follows:

dfRandSelect(bcscr::fuel, 5)
   speed efficiency
12   120       9.87
15   150      12.83
7     70       6.30
6     60       5.90
8     80       6.95
dfRandSelect(bcscr::fuel, 200)
No can do!  The frame has only 15 rows.

Hint: Use the function nrow(), which gives the number of rows of a matrix or data frame.

Exercise 5*

Create your own data frame, named myFrame. The frame should have 100 rows, along with the following variables:

  • lowerLetters: a character vector of randomly-produced 3-letter strings, like “chj”, “bbw”, and so on. The letters should all be lowercase.
  • height: a numerical vector consisting of real numbers chosen randomly between the values of 60 and 75.
  • sex: a factor whose possible value are “female” and “male”. Again, these values should be chosen randomly.

A call to str(myFrame) would come out like this (although your results will vary a bit since the vectors are constructed randomly):

str(myFrame)
'data.frame':   100 obs. of  3 variables:
 $ lowerLetters: chr  "usu" "uhl" "xyj" "uyd" ...
 $ height      : num  73.7 72.4 73.8 65.2 61.3 ...
 $ sex         : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 2 1 1 1 ...

summary() is useful when working with data frames. Here is how a call to summary(myFrame) might look:

summary(myFrame)
 lowerLetters           height          sex    
 Length:100         Min.   :60.00   female:57  
 Class :character   1st Qu.:63.63   male  :43  
 Mode  :character   Median :68.28              
                    Mean   :67.62              
                    3rd Qu.:71.63              
                    Max.   :74.57              

Hint: If you have a vector of three letters, such as

vec <- c("g", "a", "r")

then you can paste them together as follows:

paste0(vec, collapse = "")
[1] "gar"

Exercise 6*

Study the data frame fuel in the bcscr package. Note that the fuel efficiency is reported as the number of liters of fuel required to travel 100 kilometers. Look up the conversion between gallons and liters and between kilometers and miles, and use this information to create a new variable called mpg that gives the fuel efficiency as miles per gallon. While you are at it, create a new variable mph that gives the speed in miles per hour. Finally, add these new variables to the fuel data frame.

Exercise 7*

Use matrices to generalize the simulation in the Appeals Court Paradox (see Section 6.6). Your goal is to write a simulation function called appealsSimPlus() that comes with all the options provided in the text, but with additional parameters so that the user can choose:

  • the number of judges on the court;
  • the probability for each judge to make a correct decision;
  • the voting pattern (how many votes each judge gets).

A typical call to the functions should look like this:

appealsSimPlus(
  reps = 10000,
  seed = 5252, 
  probs = c(0.95, 0.90, 0.90, 0.90, 0.80),
  votes = c(2, 1, 1, 1, 0)
)

In the above call the court consists of five judges. The best one decides cases correctly 95% of the time, three are right 90% of the time and one is right 80% of the time. The voting arrangement is that the best judge gets two votes, the next three get one vote each, and the worst gets no vote. Any voting scheme—even a scheme involving fractional votes—should be allowed so long as the votes add up to the number of judges.

Here is a hint. When you write the function it may be helpful to use the fact that rbinom() can take a prob parameter that is a vector of any length. Here’s an example:

results <- rbinom(6, size = 100, prob = c(0.10, 0.50, 0.90))
results
[1] 20 49 94 15 50 88

The first and fourth entries simulate a person tossing a fair coin 100 times when she has only a 10% chance of heads. The second and fifth entries simulate the same, when the chance of heads is 50%. The third and sixth simulate coin-tossing when there is a 90% chance of heads.

If you would like to arrange the results more nicely—say in a matrix where each column gives the results for a different person—you can do so:

resultsMat <- matrix(results, ncol = 3, byrow = TRUE)
resultsMat
     [,1] [,2] [,3]
[1,]   20   49   94
[2,]   15   50   88

Of course judges don’t flip a coin 100 times, they decide one case at a time. Suppose you have five judges with probabilities as follows:

probCorrect <- c(0.95, 0.90, 0.90, 0.90, 0.80)

If you would like to simulate the judges deciding, say, 6 cases, try this:

results <- rbinom(5*6, size = 1, prob= rep(probCorrect, 6))
resultsMat <- matrix(results, nrow = 6, byrow = TRUE)
resultsMat
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    0    1
[2,]    0    1    1    1    1
[3,]    1    1    1    1    1
[4,]    1    1    1    1    1
[5,]    1    1    1    1    1
[6,]    1    1    1    1    0

When it comes to applying the voting pattern to compute the decision in each case, consider matrix multiplication. For example, suppose that the pattern is:

votes <- c(2, 1, 1, 1, 0)

Then make votes a one-column matrix and perform matrix multiplication:

correctVotes <- resultsMat %*% matrix(votes, nrow = 5)
correctVotes
     [,1]
[1,]    4
[2,]    3
[3,]    5
[4,]    5
[5,]    5
[6,]    5

Think about how to encapsulate all of this into a nice, general simulation function.


  1. Domain-specific languages (DSLs for short) stand in contrast to general-purpose programming languages that were designed to solve a wide variety of problems. Examples of important general-purpose languages include C and C++, Java, Python and Ruby. Although R is by now the one of the most widely-used DSLs in the world, there a number of other important ones, including Matlab and Otavefor scientific computing, Emacs Lisp for the renowned Emacs editor, and SQL for querying databases. JavaScript is an interesting case: it started out as a DSL for web browsers, but has since expanded to power many web applications and is now being used to develop desktop applications as well.↩︎

  2. As an example outside of programming, consider what happens when you read a piece of literature “for structure.” You begin by asking: “What kind of literature is this? Is it drama, a novel, or something else?” The answer lets you know what to expect as you read: if it’s a novel, you know to suspend disbelief, whereas if it’s a journalistic piece then you know to examine critically whatever it presents as fact. Next, you might outline the piece. When you make an outline, you are breaking the piece up into parts, and indicating how the parts relate to each other to advance the plot and/or message of the piece. Note that in the process of “reading for structure” you are following the pattern of the definition of structure offered above.↩︎