7 Data Frames

Can one be a good data analyst without being a half-good programmer? The short answer to that is, ‘No.’ The long answer to that is, ‘No.’

—Frank Harrell

Up to this point we have given a great deal of attention to vectors, and we have always treated them as one-dimensional objects: a vector has a length, but not a “width.”

It is time to begin working in two dimensions. In this Chapter we will study matrices, which are simply vectors that have both length and width. Matrices are immensely useful for scientific computation in R, but for the most part we will treat them as a warm-up for data frames—the two-dimensional R-objects that are especially designed for the storage of data collected in the course of practical data analysis. Once you understand how to construct and manipulate data frames, you will be ready to learn how to visualize and analyze data using R.

7.1 Introduction to Matrices

In R, a matrix is actually an atomic vector—it can only hold one type of element—but with two extra attributes:

a certain number of rows, and
a certain number of columns.

One way to create is matrix is to take a vector and give it those two extra attributes, via the matrix() function. Here is an example:

# this is an ordinary atomic vector:
numbers <- 1:24
# make a matrix out of it:
numbersMat <- matrix(numbers, nrow = 6, ncol = 4)
# print it out:
numbersMat

     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24

Of course if you are making a matrix out of 24 numbers and you know that it’s going to have 6 rows, then you know it must have 4 columns. Similarly, if you know the number of columns then the number of rows is determined. Hence you could have constructed the matrix with just one of the row or column arguments, like this:

numbersMat <- matrix(numbers, nrow = 6)

Notice that the numbers went down the first column, then down the second, and so on. If you would rather fill up the matrix row-by-row, then set the byrow parameter, which is FALSE by default, to TRUE. Try this:

Sometimes we like to give names to our rows, or to our columns, or even to both:

rownames(numbersMat) <- letters[1:6]
colnames(numbersMat) <- LETTERS[1:4]
numbersMat

  A  B  C  D
a 1  7 13 19
b 2  8 14 20
c 3  9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24

Matrices don’t have to be numerical. They can be character or logical matrices as well. Try this:

If you have to spread out the elements of a matrix into a one-dimensional vector, you can do so, like this:

7.1.1 Practice Exercises

Tip 7.1: Creating Matrices From a Vector

Problem
Solutions

Let’s work with the following vector:

dozen <- letters[1:12]

Starting with dozen write a command that produces the following matrix:

     [,1] [,2] [,3] [,4]
[1,] "a"  "d"  "g"  "j" 
[2,] "b"  "e"  "h"  "k" 
[3,] "c"  "f"  "i"  "l"

Starting with dozen write a command that produces the following matrix:

     [,1] [,2] [,3]
[1,] "a"  "e"  "i" 
[2,] "b"  "f"  "j" 
[3,] "c"  "g"  "k" 
[4,] "d"  "h"  "l"

Starting with dozen write a command that produces the following matrix:

     [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
[3,] "g"  "h"  "i" 
[4,] "j"  "k"  "l"

Starting with dozen, write a sequence of commands that produce the following matrix:

   c1  c2  c3 
r1 "a" "b" "c"
r2 "d" "e" "f"
r3 "g" "h" "i"
r4 "j" "k" "l"

To produce:

     [,1] [,2] [,3] [,4]
[1,] "a"  "d"  "g"  "j" 
[2,] "b"  "e"  "h"  "k" 
[3,] "c"  "f"  "i"  "l"

Write:

matrix(dozen, nrow = 3)

Or:

matrix(dozen, ncol = 4)

To produce:

     [,1] [,2] [,3]
[1,] "a"  "e"  "i" 
[2,] "b"  "f"  "j" 
[3,] "c"  "g"  "k" 
[4,] "d"  "h"  "l"

Write:

matrix(dozen, nrow = 4)

Or:

matrix(dozen, ncol = 3)

To produce

     [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
[3,] "g"  "h"  "i" 
[4,] "j"  "k"  "l"

Write:

matrix(dozen, nrow = 4, byrow = TRUE)

or:

matrix(dozen, ncol = 3, byrow = TRUE)

To produce:

   c1  c2  c3 
r1 "a" "b" "c"
r2 "d" "e" "f"
r3 "g" "h" "i"
r4 "j" "k" "l"

Write:

answerMatrix <- matrix(dozen, nrow = 4, byrow = TRUE)
rownames(answerMatrix) <- c("r1", "r2", "r3", "r4")
colnames(answerMatrix) <- c("c1", "c2", "c3")
answerMatrix

Of course you could also create the matrix by specifying the desired number of columns!

Tip 7.2: From Matrix to Vector

Problem
Solution

Suppose you make the following matrix:

smallMat <- matrix(c(8, 5, 3, 4), nrow =2)
smallMat

     [,1] [,2]
[1,]    8    3
[2,]    5    4

What’s a one-line command to get the folowing vector from smallMat?

[1] 8 5 3 4

Write your command here:

Try this:

Tip 7.3: Finding the Number of Rows

Problem
Solution

The function installed.packages() returns a matrix of information about the packages that are installed in the R system on which one is working. Each row corresponds to a single package. You can see a printout of the matrix by calling the function here (but be forewarned, it’s a large matrix):

nrow() is a function that, when given a matrix, will tell you the number of rows in that matrix. Write a one-line command to find the number of installed packages:

Tip 7.4: Finding the Number of Columns

Problem
Solution

ncol() is a function that, when given a matrix, will tell you the number of columns in that matrix. Write a one-line command to find the number of columns in the matrix returned by installed.packages().

7.2 Matrix Indexing

Matrices are incredibly useful in data analysis, but the primary reason we are talking about them now is to get you used to working in two dimensions. Let’s practice sub-setting with matrices.

We use the sub-setting operator [ to pick out parts of a matrix. For example, in order to get the element in the second row and third column of numbersMat, ask for:

numbersMat[2,3]

[1] 14

The row and column numbers are called indices.

If we want the entire second row, then we could ask for:

numbersMat[2,1:4]

 A  B  C  D 
 2  8 14 20

The result is a one-dimensional vector consisting of the elements in the second row of numbersMat. It inherits as its names the column names of numbersMat.

Actually, if you want the entire row you don’t have to specify which columns you want. Just leave the spot after the comma empty, like this:

numbersMat[2, ]

 A  B  C  D 
 2  8 14 20

What if you want some items on the second row, but only the items in columns 1, 2 and 4? Then frame your request in terms of a vector of column-indices:

numbersMat[2, c(1, 2, 4)]

 A  B  D 
 2  8 20

You can specify a vector of row-indices along with a vector of column-indices, if you like:

numbersMat[1:2, 1:3]

  A B  C
a 1 7 13
b 2 8 14

If the vector has row or column names then you may use them in place of indices to make a selection:

numbersMat[, c("B", "D")]

You can use sub-setting to change the values of the elements of a matrix

numbersMat[2,3] <- 0
numbersMat

  A  B  C  D
a 1  7 13 19
b 2  8  0 20
c 3  9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24

You can assign a value to an entire row:

numbersMat[2,] <- 0
numbersMat

  A  B  C  D
a 1  7 13 19
b 0  0  0  0
c 3  9 15 21
d 4 10 16 22
e 5 11 17 23
f 6 12 18 24

In the code above, the 0 was “recycled” into each of the four elements of the second row.

You can assign the elements of a vector to corresponding selected elements of a matrix:

numbersMat[2,] <- c(100, 200, 300, 400)
numbersMat

    A   B   C   D
a   1   7  13  19
b 100 200 300 400
c   3   9  15  21
d   4  10  16  22
e   5  11  17  23
f   6  12  18  24

7.2.1 To Drop or Not?

Note that when we ask for a single row of numbersMat we get a regular one-dimensional vector:

numbersMat[3, ]

 A  B  C  D 
 3  9 15 21

The same things happens if we ask for a single column:

numbersMat[ , 2]

  a   b   c   d   e   f 
  7 200   9  10  11  12

We get the second column of numbersMat, but as a regular vector. It’s not a “column” anymore. (Note that it inherits the row names from numbersMat.)

When a subset of a matrix comes from only one row or column, R takes the opportunity to “drop” the class of the subset from “matrix” to “vector.” If you would like the subset to stay a vector, set the drop parameter, which by default is TRUE, to FALSE. Thus the second column of numbersMat, kept as a matrix with six rows and one column, is found as follows:

numbersMat[ , 2, drop = FALSE]

In most applications people want the simpler vector structure, so they usually leave drop at its default value.

7.2.2 Practice Exercises

In these exercises we will work with the practiceMatrix, a matrix with ten rows and four columns that prints out as:

   A  B  C  D
a  1 11 21 31
b  2 12 22 32
c  3 13 23 33
d  4 14 24 34
e  5 15 25 35
f  6 16 26 36
g  7 17 27 37
h  8 18 28 38
i  9 19 29 39
j 10 20 30 40

Tip 7.5: Indexing

Problem
Solution

Write two different one-line commands that both produce this matrix:

   B  C
a 11 21
e 15 25

practiceMatrix[c(1,5), 2:3]
practiceMatrix[c("a","e"), LETTERS[2:3]]

Tip 7.6: Getting All the Odd rows

Problem
Solution

Write a one-line command to get this matrix:

  A  B  C  D
a 1 11 21 31
c 3 13 23 33
e 5 15 25 35
g 7 17 27 37
i 9 19 29 39

practiceMatrix[seq(1, 9, by = 2), ]

Tip 7.7: Getting a Column

Problem
Solution

Write a one-line command to get this vector:

 a  b  c  d  e  f  g  h  i  j 
 1  2  3  4  5  6  7  8  9 10

practiceMatrix[ , 1]

Tip 7.8: Getting a Row

Problem
Solution

Write a one-line command to get this vector:

 A  B  C  D 
 2 12 22 32

practiceMatrix[2, ]

Tip 7.9: Getting a Column

Problem
Solution

Write a one-line command to get this matrix:

   A
a  1
b  2
c  3
d  4
e  5
f  6
g  7
h  8
i  9
j 10

practiceMatrix[ , 1, drop = FALSE]

Tip 7.10: Getting All Except One Row

Problem
Solution

Write a one-line command to get this matrix:

  A  B  C  D
a 1 11 21 31
b 2 12 22 32
c 3 13 23 33
d 4 14 24 34
e 5 15 25 35
f 6 16 26 36
g 7 17 27 37
h 8 18 28 38
i 9 19 29 39

practiceMatrix[-10, ]

Tip 7.11: Getting All Except Two Rows

Problem
Solution

Write a convenient one-line command to get this matrix:

  A  B  C  D
a 1 11 21 31
c 3 13 23 33
d 4 14 24 34
e 5 15 25 35
f 6 16 26 36
g 7 17 27 37
h 8 18 28 38
i 9 19 29 39

practiceMatrix[-c(2, 10), ]

Tip 7.12: A Function to Sum Rows

Problem
Solution

Write a function called myRowSums() that will find the sums of the rows of any given matrix. The function should use a for-loop (see the Chapter on Flow Control). The function should take a single parameter called mat, the matrix whose rows the user wishes to sum. It should work like this:

myRowSums(mat = practiceMatrix)

 [1]  64  68  72  76  80  84  88  92  96 100

Here is one way to write the function:

myRowSums <- function(mat) {
  ## find out how many rows there are:
  n <- nrow(mat)
  ## set up a results vector for the sums:
  sums <- numeric(n)
  ## loop to find and store the row sums:
  for (i in 1:n) {
    sums[i] <- sum(mat[i, ])
  }
  ## return the sums:
  sums
}

Let’s test it:

myRowSums(mat = practiceMatrix)

 [1]  64  68  72  76  80  84  88  92  96 100

7.3 Operations on Matrices

Matrices can be involved in arithmetical and logical operations.

7.3.1 Arithmetical Operations

The usual arithmetic operations apply to matrices, operating element-wise. For example, suppose that we have:

mat1 <- matrix(rep(1, 4), nrow = 2)
mat2 <- matrix(rep(2, 4), nrow = 2)

To get the sum of the above two matrices, R adds their corresponding elements and forms a new matrix out of their sums, thus:

R applies recycling as needed. For example, suppose we have:

mat <- matrix(1:4, nrow = 2)
mat

     [,1] [,2]
[1,]    1    3
[2,]    2    4

In order to multiply each element of mat by 2, we need not create a 2-by-2 matrix of 2’s. We can simply multiply by 2, and R will take care of recycling the 2:

2 * mat

     [,1] [,2]
[1,]    2    6
[2,]    4    8

Or we could subtract 3 from each element of mat:

mat - 3

     [,1] [,2]
[1,]   -2    0
[2,]   -1    1

7.3.2 Logical Operations

Boolean operations apply to matrices element-wise, just as they do to ordinary vectors. The result is a matrix of logical values. For examples, consider the original matrix numbersMat:

numbersMat <- matrix(1:24, nrow = 6)

Suppose we wish to determine which elements of numbersMat are odd. Then we simply ask whether the remainder of an element after division by 2 is equal to 1:

We can select elements from a matrix using a Boolean operator, too:

Note that the result is an ordinary, one-dimensional vector.

7.3.3 Practice Exercises

We’ll work with the following three matrices:

a <- matrix(c(7, 4, 9, 10), nrow = 2)
a

     [,1] [,2]
[1,]    7    9
[2,]    4   10

b <- matrix(1:4, nrow = 2)
b

     [,1] [,2]
[1,]    1    3
[2,]    2    4

c <- matrix(letters[1:24], nrow = 6, byrow = TRUE)
c

     [,1] [,2] [,3] [,4]
[1,] "a"  "b"  "c"  "d" 
[2,] "e"  "f"  "g"  "h" 
[3,] "i"  "j"  "k"  "l" 
[4,] "m"  "n"  "o"  "p" 
[5,] "q"  "r"  "s"  "t" 
[6,] "u"  "v"  "w"  "x"

Tip 7.13: Addition

Problem
Solution

Find a one-line command using a that results in:

     [,1] [,2]
[1,]   10   12
[2,]    7   13

a + 3

     [,1] [,2]
[1,]   10   12
[2,]    7   13

Tip 7.14: Multiplication

Problem
Solution

Find a one-line command using a that results in:

     [,1] [,2]
[1,]   14   18
[2,]    8   20

2 * a

     [,1] [,2]
[1,]   14   18
[2,]    8   20

Tip 7.15: Raising to a Power

Problem
Solution

Find a one-line command using a that results in:

     [,1] [,2]
[1,]   49   81
[2,]   16  100

a^2

     [,1] [,2]
[1,]   49   81
[2,]   16  100

Tip 7.16: Subtration

Problem
Solution

Find a one-line command using a and b that results in:

     [,1] [,2]
[1,]    6    6
[2,]    2    6

a - b

     [,1] [,2]
[1,]    6    6
[2,]    2    6

Tip 7.17: A Boolean Expression

Problem
Solution

Describe in words what the following command does:

a > 5

It produces a logical matrix of the same dimensions as a. The new matrix will have TRUE in a cell when the corresponding cell of a is greater than 5. Otherwise, the cell will have FALSE in it.

a > 5

      [,1] [,2]
[1,]  TRUE TRUE
[2,] FALSE TRUE

Tip 7.18: When is the Remainder 1?

Problem
Solution

Write a one-line command using a that tells you which elements of a are one more than a multiple of 3.

Here’s one way:

a %% 3 == 1

     [,1]  [,2]
[1,] TRUE FALSE
[2,] TRUE  TRUE

Tip 7.19: Alphabetical Order

Problem
Solution

Usingc, write a one-line boolean expression that produces the following:

      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE  TRUE
[3,]  TRUE  TRUE  TRUE  TRUE
[4,]  TRUE  TRUE  TRUE  TRUE
[5,]  TRUE  TRUE  TRUE  TRUE
[6,]  TRUE  TRUE  TRUE  TRUE

Here’s one way:

c >= "h"

      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE  TRUE
[3,]  TRUE  TRUE  TRUE  TRUE
[4,]  TRUE  TRUE  TRUE  TRUE
[5,]  TRUE  TRUE  TRUE  TRUE
[6,]  TRUE  TRUE  TRUE  TRUE

7.4 Introduction to Data Frames

R is sometimes spoken of as a domain-specific programming language, meaning that although it can in principle perform any sort of computation that a human can perform (given enough pencil, paper and time), it was originally designed to perform tasks in a particular area of application. R’s original area of application is data analysis and statistics, especially when performed interactively—i.e., in a setting where the analyst asks for a relatively small computation, examines the results, modifies his or her requests and asks again, and so on.¹ Although R can be used effectively for a wide range of programming tasks, data analysis is where it really shines.

The data structures of R reflect its orientation to data analysis. We have met a data-oriented structure already—the table, which is one of many convenient ways to display the results of data analysis. For the purpose of organizing data in preparation for analysis, R provides the structure known as the data frame. A data frame facilitates the storage of related data in one location, in a form that makes the most sense to human users.

A data frame is like a matrix in that it is two-dimensional—it has rows and columns. Unlike a matrix, though, the elements of a data frame do not have to be all of the same data-type. Each column of a data frame is a vector—of the same length as all the others—but these vectors may be of different types: some numerical, some logical, etc.

7.4.1 Viewing a Data Frame

Let’s take a close look at a data frame: the frame m111survey, which is available from the bcscr package (White 2025).

m111survey Info

Description

 Results of a survey of MAT 111 students at Georgetown College.
 
         • height.  How tall are you, in inches?
 
         • ideal_ht.  A numeric vector How tall would you LIKE to be, in
           inches?
 
         • sleep.  How much sleep did you get last night?
 
         • fastest.  What is the highest speed at which you have ever
           driven a car?
 
         • weight_feel.  How do you feel about your weight?
 
         • love_first.  Do you believe in love at first sight?
 
         • extra_life.  Do you believe in extraterrestrial life?
 
         • seat.  When you have a choice, where do you prefer to sit in
           a classroom?
 
         • GPA.  What is your college GPA?
 
         • enough_Sleep.  Do you think you get enough sleep?
 
         • sex.  What sex are you?
 
         • diff.  Your ideal height minus your actual height.

View

Data Table 7.1

To learn about the data frame in an R session, we would first attach the package itself:

library(bcscr)

In the R Studio IDE, we can get a look at the frame in a tab in the Editor pane if we use the View() function:

View(m111survey)

As with many objects provided by a package, we can get more information about it. Try this:

From the Help we see that m111survey records the results of a survey conducted in a number of sections of an elementary statistics course at Georgetown College. From the View we see that the frame is arranged in rows and columns. Each row corresponds to what in data analysis is known as a case or an individual: here, each row goes with a student who participated in the survey. The columns correspond to variables: measurements made on each individual. For a student on a given row, the values in the columns are the values recorded for that student.

When you are not working in R Studio, there are still a couple of ways to view the frame. You could print it all out to the console. (you can try this in the followng code-field, but it’s going to be quite long and messy!)

You could also use the head() function to view a specified number of initial rows:

7.4.2 The Stucture of a Data Frame

Further information about the frame may be obtained with the str() function. . Try this:

The concept of structure extends far beyond the domain of computer programming.² In general the structure of any object consists of:

the kind of thing that the object is;
the parts of the object is made up of;
the relationships between these parts—the rules, if you will, for how the parts work together to make the object do what it does.

In the case of m111survey the kind of thing this is its class: it’s a data frame. By the way, the function class() will tell you the class of any given object:

Next we see the account of the parts of the object and the way in which the parts relate to one another:

71 obs. of  12 variables

From this we know that there are 71 individuals in the study. The data consists of 12 “parts”—the variables—which are related in the sense that they all provide information about the same set of 71 people.

After that the output of str() launches into an account of the structure of each of the parts, for example:

$ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...

We are told the kind of thing that height is: it’s a numerical vector (a vector of type double, in fact). Next we are given the beginning of a statement of its parts: the heights of the individuals. So R is actually giving us the structure of the parts, as well as of the whole m11survey.

The variable fastest refers to the fastest speed—in miles per hour—that a person has ever driven a car. Note that it is a vector of type integer. Officially this is a numerical variable, too, but R is calling attention to the fact that the fastest-speed data is being stored as integers rather than as floating-point decimals.

The variables of a data frame are typically associated with the names of the frame. Try this:

By means of the names we can isolate a vector in any column, identified in our code in the format frame$variable. For example, to see the first ten elements of the fastest variable, we ask for:

In order to compute the mean fastest speed our subjects drove their cars, we can ask for:

If you want to see the speeds that are at least 150 miles per hour, you could ask for:

If you worry that the form frame$variable will require an annoying amount of typing—as seems to be the case in the the example above—then you can use the with() function:

How with() Works

It’s instructive to consider how with() works. If we were to include the names of the parameters of with() explicitly, then the call would have looked like this:

with(data = m111survey, expr = fastest[fastest >=150])

For the data parameter we can supply a data frame or any other R-object that can be used to construct an environment . In this case m111survey provides a miniature environment consisting of the names of its variables. For the expr parameter we supply an expression for R to evaluate. As R evaluates the expression, it encounters names (such as fastest). Now ordinarily R would first search whatever counts as the active environment—in this case it’s the Global Environment—for the names in the expression, but with() forces R to look first within the environment created by the data argument. In our example, R finds fastest inside m111survey and evaluates the expression on that basis. If it had not found fastest in m111survey, R would have moved on to the Global Environment and then the rest of the usual search path and (probably) would have found nothing, causing it to throw an “object not found” error message. In R, as in any other programming language, good programming depends very much on paying attention to how the language searches for the objects to which names refer.

7.4.3 Factors

Some of the variables in m111survey are called factors; an example is seat, which pertains to where one prefers to sit in a classroom:

Seating preference is an example of a categorical variable: one whose values are not meaningfully expressed in terms of numbers. When a categorical variable has a relatively small number of possible values, it can be convenient to store its values in a vector of class factor.

The levels of factor variable are its possible values. In the case of seat, these are: Front, Middle and Back. As a memory-saving measure, R stores the values in the factor as numbers, where 1 stands for the first level, 2 for the second level, and so on. But please bear in mind that we are dealing with a categorical variable, so the numbers don’t relate to the possible values in any natural way: they are just storage conventions.

It’s possible to create a factor from any type of vector, but most often this is done with a character vector. Suppose for instance, that eight people are asked for their favorite Wizard of Oz character and they answer:

ozFavs <- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
            "Glinda", "Scarecrow", "Dorothy")

We can create a factor variable as follows:

factorFavs <- factor(ozFavs)
factorFavs

[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Dorothy Glinda Scarecrow Toto

Note that the levels are given in alphabetical order: this is the default procedure when R creates a factor. It is possible to ask for a different order, though:

factor(ozFavs, levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))

[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Toto Scarecrow Glinda Dorothy

In many instances it is appropriate to convert a character vector to a factor, but sometimes this is not such a great idea. Consider something like your address, or your favorite inspirational quote: pretty much every person in a study will have a different address or favorite quote than others in the study. Hence there won’t be any memory-storage benefit associated with creating a factor: the vector of levels—itself a character vector—would require as much storage space as the original character vector itself! In addition, we will see that the status of a variable as class “factor” can affect how R’s statistical and graphical functions deal with it. It’s not a good idea to treat a categorical variable as a factor unless its set of possible values is considered important.

We will think more about how to deal with factor variables later on, when we begin data analysis in earnest.

7.4.4 Practice Exercises

In these exercises we’ll work with a new data frame: railtrail.

railtrail Info

Description

 This data table is modifed slightly from mosaicData::RailTrail, (see
 http://cran.r-project.org/web/packages/mosaicData/mosaicData.pdf).
 Description below is drawn from the mosaicData help file.

Format

 A data frame with 90 observations on the following variables.
 
         • hightemp daily high temperature (in degrees Fahrenheit)
 
         • lowtemp daily low temperature (in degrees Fahrenheit)
 
         • avgtemp average of daily low and daily high temperature (in
           degrees Fahrenheit)
 
         • season spring, summer or fall
 
         • cloudcover measure of cloud cover (in oktas)
 
         • precip measure of precipitation (in inches)
 
         • volume estimated number of trail users that day (number of
           breaks recorded)
 
         • weekday logical indicator of whether the day was a
           non-holiday weekday
 
         • dayType one of "weekday" or "weekend"

Details

 The Pioneer Valley Planning Commission (PVPC) collected data north of
 Chestnut Street in Florence, MA for ninety days from April 5, 2005 to
 November 15, 2005. Data collectors set up a laser sensor, with breaks
 in the laser beam recording when a rail-trail user passed the data
 collection station.
 
 There is a potential for error when two users trigger the infrared beam
 at exactly the same time since the counter would only logs one of the
 crossings.  The collectors left the motion detector out during the
 winter, but because the counter drops data when the temperature falls
 below 14 degrees Fahrenheit, there is no data for the cold winter
 months.

View

Data Table 7.2

Tip 7.20: Learning About a Data Frame

Problem
Solution

In an R session, how would you learn more about the data frame railtrail from the bcscr package?

help("railtrail")

A shortcut for help() is the question-mark:

?railtrail

Tip 7.21: The First Few Rows

Problem
Solution

Write a one-line command to see the first 10 rows of railtrail in the Console.

head(railtrail, n = 10)

   hightemp lowtemp avgtemp cloudcover precip volume weekday dayType season
1        83      50    66.5        7.6   0.00    501    TRUE weekday summer
2        73      49    61.0        6.3   0.29    419    TRUE weekday summer
3        74      52    63.0        7.5   0.32    397    TRUE weekday spring
4        95      61    78.0        2.6   0.00    385   FALSE weekend summer
5        44      52    48.0       10.0   0.14    200    TRUE weekday spring
6        69      54    61.5        6.6   0.02    375    TRUE weekday spring
7        66      39    52.5        2.4   0.00    417    TRUE weekday spring
8        66      38    52.0        0.0   0.00    629   FALSE weekend spring
9        80      55    67.5        3.8   0.00    533   FALSE weekend summer
10       79      45    62.0        4.1   0.00    547    TRUE weekday summer

Tip 7.22: Names of the Variables

Problem
Solution

Write a one-line command to get the names of all of the variables in railtrail.

names(railtrail)

[1] "hightemp"   "lowtemp"    "avgtemp"    "cloudcover" "precip"    
[6] "volume"     "weekday"    "dayType"    "season"

Tip 7.23: High Temperatures

Problem
Solution

Regarding railtrail: write a one-line command to get the high temperature on all the days when the precipitation was more than 0.5 inches.

with(railtrail, hightemp[precip > 0.5])

[1] 97 81 64 84

Tip 7.24: Sorting Temperatures

Problem
Solution

Regarding railtrail: write a one-line command to sort the average temperatures from highest to lowest.

You can try this:

sort(railtrail$avgtemp, decreasing = TRUE)

This will also work:

with(railtrail, sort(avgtemp, decreasing = TRUE))

 [1] 84.0 80.5 79.0 79.0 78.0 78.0 77.5 77.0 76.5 74.0 73.0 73.0 71.5 71.0 68.5
[16] 68.0 67.5 66.5 66.5 66.5 66.5 66.0 64.5 64.5 64.5 63.0 63.0 62.5 62.0 62.0
[31] 61.5 61.0 61.0 61.0 60.0 60.0 60.0 59.5 59.5 59.0 58.5 56.5 56.5 56.5 55.5
[46] 55.0 55.0 55.0 55.0 54.5 54.5 53.5 53.0 52.5 52.5 52.5 52.5 52.0 52.0 51.5
[61] 50.5 50.0 50.0 49.5 49.0 49.0 49.0 48.5 48.0 48.0 47.5 47.5 47.0 47.0 47.0
[76] 47.0 46.5 46.5 46.0 46.0 45.0 45.0 45.0 43.0 42.5 42.5 39.5 38.0 35.0 33.0

7.5 Creating Data Frames

There are many ways to create data frames in R. Here we will introduce just two ways.

7.5.1 Creation from Vectors

Whenever you have vectors of the same length, you can combine them into a data frame, using the data.frame() function:

n <- c("Dorothy", "Lion", "Scarecrow")
h <- c(58, 75, 69)
a <- c(12, 0.04, 18)
ozFolk <- data.frame(name = n, height = h, age = a)
ozFolk

       name height   age
1   Dorothy     58 12.00
2      Lion     75  0.04
3 Scarecrow     69 18.00

Note that at the time of creation you can provide the variables with any names that you like. If later on you change your mind about the names, you can always revise them:

names(ozFolk)

[1] "name"   "height" "age"

names(ozFolk)[2] <- "Height"  # "height" was at index 2"
ozFolk

       name Height   age
1   Dorothy     58 12.00
2      Lion     75  0.04
3 Scarecrow     69 18.00

7.5.2 Creation From Other Frames

If two frames have the same number of rows, you may combine their columns to form a new frame with the cbind() function:

ozMore <- data.frame(
  color = c("blue", "red", "yellow"),
  desire = c("Kansas", "courage", "brains")
)
cbind(ozFolk, ozMore)

       name Height   age  color  desire
1   Dorothy     58 12.00   blue  Kansas
2      Lion     75  0.04    red courage
3 Scarecrow     69 18.00 yellow  brains

Similarly if two data frames have the same number and type of columns then we can use the rbind() function to combine them:

ozFolk2 <- data.frame(
  name = c("Toto", "Glinda"),
  Height = c(12, 66), age = c(3, 246)
)
rbind(ozFolk, ozFolk2)

       name Height    age
1   Dorothy     58  12.00
2      Lion     75   0.04
3 Scarecrow     69  18.00
4      Toto     12   3.00
5    Glinda     66 246.00

Note: cbind() and rbind() work for matrices, too.

7.6 Subsetting Data Frames

Our study of sub-setting matrices can be applied to the selection of parts of a data frame. As with a vector, one or both of the dimensions of the frame can come into play.

We can create a new data frame consisting of any columns we like from the original frame. Try this:

If we select just one column, then the result is a vector rather than a data frame:

df <- m111survey[, "height"]
is.vector(df)

[1] TRUE

If for some reason you want to prevent this, set drop to FALSE:

df <- m111survey[, "height", drop =FALSE]
head(df)

You may select particular rows, too. For example:

You can even select some of the rows at random. The code below selects six rows at random; try it a few times.

Listing 7.1: An example of selecting randwom rows from a data frame.

Note the function nrow() that gives the number of rows of the frame. When we sample six items without replacement from the vector 1:n, we are picking six numbers at random from the row-numbers of the vector. Specifying these six numbers in the selection operator [ yields the desired random sample of rows.

7.6.1 Boolean Expressions

It is especially common to select rows by the values of a logical vector. For example, to select the rows where the fast speed ever driven is at least 150 miles per hour, try this:

When you are selecting rows it can be convenient to use the subset() function. The first argument to the function is the frame from which you plan to select, and the second is the Boolean expression by which to select:

Note that we did not need to type m111survey$fastest: the first argument to subset() provides the environment in which to search for names that appear in the Boolean expression.

The Boolean sub-setting expressions can be quite complex. For example, consider this:

Note: subset() takes a third parameter called select that allows you to pick out any desired columns. For example:

7.6.2 Practice Exercises

We’ll use the CPS85 data frame from the mosaicData package. You should go ahead and attach the package and then read about the data frame:

library(mosaicData)
?CPS85

CPS85 Info

Description

 The Current Population Survey (CPS) is used to supplement census
 information between census years. These data consist of a random sample
 of persons from the CPS85, with information on wages and other
 characteristics of the workers, including sex, number of years of
 education, years of work experience, occupational status, region of
 residence and union membership.

Format

 A data frame with 534 observations on the following variables.
 
 'wage' wage (US dollars per hour)
 
 'educ' number of years of education
 
 'race' a factor with levels 'NW' (nonwhite) or 'W' (white)
 
 'sex' a factor with levels 'F' 'M'
 
 'hispanic' a factor with levels 'Hisp' 'NH'
 
 'south' a factor with levels 'NS' 'S'
 
 'married' a factor with levels 'Married' 'Single'
 
 'exper' number of years of work experience (inferred from 'age' and
           'educ')
 
 'union' a factor with levels 'Not' 'Union'
 
 'age' age in years
 
 'sector' a factor with levels 'clerical' 'const' 'manag' 'manuf'
           'other' 'prof' 'sales' 'service'

View

Data Table 7.3

Each row in the data frame corresponds to an employee in the survey.

Tip 7.25: Counting the Rows

Problem
Solution

Write a command that gives the number of employees in the data frame.

nrow(CPS85)

[1] 534

Tip 7.26: Selecting Rows According to Age

Problem
Solution

Select the employees who are between 40 and 50 years old.

subset(CPS85, age > 40 & age < 50)

Or, without subset():

CPS85[CPS*%$age > 40 & CPS85$age < 50, ]

Tip 7.27: More Selection of Rows

Problem
Solution

Select the employees who are married and have fewer than 30 years of experience.

subset(CPS85, married == "Married" & exper < 30)

Tip 7.28: Selecting Rows Again

Problem
Solution

Select the nonunion employees who either live in the South or who have more than 12 years of education (or both).

subset(CPS85, union == "Not" & (south == "S" | educ > 12))

Tip 7.29: Several Sectors

Problem
Solution

Select the employees who work in the clerical, construction, management or professional sector.

You can try this:

subset(
  CPS85, 
  sector %in% c(
    "clerical", "construction",
    "management", "professional"
    )
  )

Tip 7.30: Selecting Rows and Columns

Problem
Solution

Select the employees who make more than 30 dollars per hour, and keep only their wage, sex and sector of employment:

You can try this:

subset(
  CPS85, 
  wage > 30,
  select = c("wage", "sex", "sector")
  )

Or, without subset():

CPS85[CPS85$wage > 30, c("wage", "sex", "sector")]

Tip 7.31: Selecting Rows at Random

Problem
Solution

Select 10 employees at random, keeping only their wage and sex.

Here is one way:

random_rows <- sample(1:nrow(CPS85), size = 10, replace = FALSE)
CPS85[random_rows, c("wage", "sex")]

    wage sex
450 3.75   M
483 6.40   F
486 5.13   F
181 6.75   M
262 3.00   F
88  9.37   M
93  4.00   M
128 5.50   M
131 9.50   F
417 9.50   M

Tip 7.32: Selecting All Except Certain Columns

Problem
Solution

Select all of the employees, keeping all information about them except for their union status and whether or not they are from the South.

Try this (south and union are columns 6 and 9, respectively):

CPS85[ , -c(6, 9)]

The select parameter of the subset() function allows you to specify columns to omit by name, so the following is another solution:

subset(CPS85, select = -c(south, union))

7.7 New Variables from Old

Quite often you will want to transform one or more variables in a data frame. Transforming a variable means changing its values in a systematic way.

For example, you might want to measure height in feet rather than inches. Then you want the following

heightInFeet <- with(m111survey, height/12)  # 12 inches in a foot

If you plan to use this new variable in your analysis later on, it might be a good idea to add it to the data frame:

m111survey$height_ft <- heightInFeet

Another common need is to recode the values of a categorical variable. For example, you might want to divide people into two groups: those who prefer to sit in the back and those who don’t. This is a good time to use ifelse():

seat2 <- ifelse(m111survey$seat == "3_back", "Back", "Other")
m111survey$seat2 <- seat2

If you plan to re-code into a variable that involves more than two values, then you might want to look into the mapvalues() function from the plyr package (Wickham 2023):

seat3 <- plyr::mapvalues(
  m111survey$seat,
  from = c("1_front", "2_middle", "3_back"),
  to = c("Front", "Middle", "Back")
)
str(seat3)

 Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...

Another common transformation involves turning a numerical variable into a factor. For example, we might need to classify people as:

Tall (height over 70 inches)
Medium (65 - 70 inches)
Short (less than 65 inches)

The cut() function will be helpful.

Listing 7.2: An illustraion of the cut() function

heightClass <- cut(
  m111survey$height,
  breaks = c(-Inf, 65, 70, Inf),
  labels = c("Short", "Medium","Tall"),
  right = TRUE
)
str(heightClass)

 Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 2 3 1 2 ...

Setting right = TRUE indicates that the upper bound of each interval is included in the interval. Thus, a person with a height of 70 inches is classed as Medium, not Tall.

7.7.1 Getting Rid of Variables

We have added several variables to m111survey. In order to remove them (or any other variables we don’t want) we can assign them the value NULL.

names(m111survey)

 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."
[13] "height_ft"       "seat2"

m111survey$height_ft <- NULL
m111survey$seat2 <- NULL
m111survey$seat3 <- NULL
names(m111survey)  # the extra variables are gone

 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."

7.7.2 Practice Exercises

For practice, we’ll create a copy of mosaicData::CPS85:

library(mosaicData)
CPS_practice <- CPS85

Tip 7.33: Removing Some Variables

Problem
Solution

Remove the variables hispanic and married from the CPS_practice data frame.

Here’s one way to do it:

CPS_practice$hispanic <- NULL
CPS_practice$married <- NULL

Tip 7.34: Transforming a Variable

Problem
Solution

Change the units of wage in CPS_practice from dollars per hour to dollars per day. Assume an eight-hour working day.

Here’s one way to do it:

CPS_practice$wage <- CPS_practice$wage * 8

Tip 7.35: Creating a New Variable

Problem
Solution

For CPS_practice, create a new variable experGrp that has the following values

low for experience less than 10 years;
medium for experience of at least 10 years but less than 25 years;
high for experience at least 25 years.

Next, use the experGrp variable to create the following tally of the ages of the employees:

experGrp
   low medium   high 
   179    217    138

First use cut() to make the new variable:

## make the new variable with cut():
CPS_practice$experGrp <- cut(
  CPS_practice$exper,
  breaks = c(-Inf, 10, 25, Inf),
  labels = c("low", "medium", "high")
)

Then it’s easy to use table() (see Section 6.3.3) to make the required table:

table(CPS_practice$experGrp)

7.8 More in Depth

7.8.1 Matrix Multiplication

This section may interest you if you know about matrix multiplication in linear algebra.

In order to accomplish matrix multiplication, we have to keep in mind that the regular multiplication operator * works element-wise on matrices, as we have already seen. For matrix multiplication R provides the special operator %*%. For example, consider the following matrices:

a <- matrix(1:6, ncol = 3)
a

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

b <- matrix(c(2, 1, -1), nrow = 3)
b

     [,1]
[1,]    2
[2,]    1
[3,]   -1

Observe that the number of columns of a is equal to the number of rows of b. Hence it is possible to form the matrix product a %*% b:

a %*% b

     [,1]
[1,]    0
[2,]    2

As expected, the result is a matrix having as many rows as the rows of aand as many columns as the columns of b.

It is also interesting to recall how matrix multiplication works when the second matrix has only one column. The product is obtained by multiplying each column of a by the element on the corresponding row of b, and adding the resulting matrices:

b[1,1]*a[ ,1, drop = FALSE] + b[2,1, drop = FALSE]*a[ ,2] + b[3,1]*a[ ,3, drop = FALSE]

     [,1]
[1,]    0
[2,]    2

7.8.2 Ordering Data Frames

You can reorder as well as select. For example, the following code selects the first five rows ofm111survey and then reverses them:

df <- m111survey[, c("height", "ideal_ht")]
dfRev <- df[5:1, ]
head(dfRev)

  height ideal_ht
5     72       72
4     62       65
3     64       NA
2     74       76
1     76       78

If you want, you can even scramble the rows of the data frame in a random order:

n <- nrow(m111survey)
shuffle <- sample(1:n, size = n, replace = FALSE)
df <- m111survey[shuffle, ]
head(df[c("sex", "seat")])  #show just two columns

      sex     seat
25 female 2_middle
51 female 2_middle
69 female  1_front
52 female 2_middle
64   male   3_back
13 female  1_front

It is quite common to order the rows of a frame according to the values of a particular variable. For example, you might want to arrange the rows by height, so that the frame begins with the shortest subject and ends with the tallest.

Accomplishing this task requires a study of R’s order() function. Consider the following vector:

vec <- c(15, 12, 23, 7)

Call order() with this vector as an argument:

order(vec)

[1] 4 2 1 3

order() returns the indices of the elements of vec, in the following order:

the index of the smallest element (7, at index 4 of vec);
the index of the second-smallest element (12, at index 2 of vec);
the index of the third-smallest element (15, at index 1 of vec);
the index of the largest element (23, at index 3 of vec).

Can you guess the output of the following function-call without looking for the answer underneath?

vec[order(vec)]

[1]  7 12 15 23

Sure enough, the result is vec sorted: from smallest to largest element.

Now the sorting of vec could have been accomplished with R’s sort()function:

sort(vec)

[1]  7 12 15 23

The power of order() comes with the rearrangement of rows of a data frame. In order to “sort” the frame from shortest to tallest subject, call:

df <- m111survey[order(m111survey$height), ]
head(df[, c("sex", "height")])  # to show that it worked

      sex height
45 female     51
26 female     54
9  female     59
13 female     59
40 female     60
69 female     61

If you want to order the rows from tallest to shortest instead, then use the decreasing parameter, which by default is FALSE:

df <- m111survey[order(m111survey$height, decreasing = TRUE), ]
head(df[, c("sex", "height")])  # to show that it worked

      sex height
8    male     79
14 female     78
1    male     76
58   male     76
34   male     75
54   male     75

Sometimes you want to order by two or more variables. For example suppose you want to arrange the frame so that the folks preferring to sit in front come first, followed by the people who prefer the middle and ending with the people who prefer the back. Within these groups you would like people to be arranged from shortest to tallest. Then call:

ordering <- with(m111survey, order(seat, height))
df <- m111survey[ordering, ]
head(df[, c("seat", "height")], n = 10)  # see if it worked

      seat height
45 1_front     51
26 1_front     54
13 1_front     59
69 1_front     61
4  1_front     62
12 1_front     62
23 1_front     63
38 1_front     63
61 1_front     63
57 1_front     64

7.8.3 Combining With `rbind()` and `cbind()`

If two matrices have the same number of rows, then you can bind their columns together to create a new matrix, using the cbind() function:

lowercase <- matrix(letters, nrow = 13)
uppercase <- matrix(LETTERS, nrow = 13)
both_cases <- cbind(lowercase, uppercase)
both_cases

      [,1] [,2] [,3] [,4]
 [1,] "a"  "n"  "A"  "N" 
 [2,] "b"  "o"  "B"  "O" 
 [3,] "c"  "p"  "C"  "P" 
 [4,] "d"  "q"  "D"  "Q" 
 [5,] "e"  "r"  "E"  "R" 
 [6,] "f"  "s"  "F"  "S" 
 [7,] "g"  "t"  "G"  "T" 
 [8,] "h"  "u"  "H"  "U" 
 [9,] "i"  "v"  "I"  "V" 
[10,] "j"  "w"  "J"  "W" 
[11,] "k"  "x"  "K"  "X" 
[12,] "l"  "y"  "L"  "Y" 
[13,] "m"  "z"  "M"  "Z"

If two matrices have the same number of columns, then you can bind their rows together, with rbind():

lowercase2 <- matrix(letters, ncol = 13)
uppercase2 <- matrix(LETTERS, ncol = 13)
both_cases2 <- rbind(lowercase2, uppercase2)
both_cases2

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a"  "c"  "e"  "g"  "i"  "k"  "m"  "o"  "q"  "s"   "u"   "w"   "y"  
[2,] "b"  "d"  "f"  "h"  "j"  "l"  "n"  "p"  "r"  "t"   "v"   "x"   "z"  
[3,] "A"  "C"  "E"  "G"  "I"  "K"  "M"  "O"  "Q"  "S"   "U"   "W"   "Y"  
[4,] "B"  "D"  "F"  "H"  "J"  "L"  "N"  "P"  "R"  "T"   "V"   "X"   "Z"

rbind() and cbind() work with data frames, too. Here, we use rbind() to add a new row to a data frame:

n <- c("Dorothy", "Scarecow", "Lion")
h <- c(58, 75, 69)
a <- c(12, 0.04, 18)
oz_folk <- data.frame(name = n, height = h, age = a)
one_more_person <- data.frame(
  name = "Tin Man",
  height = 72,
  age = 24
)
all_together <- rbind(oz_folk, one_more_person)
all_together

      name height   age
1  Dorothy     58 12.00
2 Scarecow     75  0.04
3     Lion     69 18.00
4  Tin Man     72 24.00

We can add new columns as well:

new_properties <- data.frame(
  desire = c("Kansas", "brains", "courage", "a heart"),
  fav_color = c("crimson", "blue", "burlywood", "orange")
)
cbind(all_together, new_properties)

      name height   age  desire fav_color
1  Dorothy     58 12.00  Kansas   crimson
2 Scarecow     75  0.04  brains      blue
3     Lion     69 18.00 courage burlywood
4  Tin Man     72 24.00 a heart    orange

7.8.4 Practice Exercises

Consider the following vector:

creatures <- c("Mole", "Frog", "Rat", "Badger")

Write down what you think will be the result of the call:

order(creatures)

Then check your answer by actually running:

creatures <- c("Mole", "Frog", "Rat", "Badger")
order(creatures)

What will be the result of the following?

order(creatures, decreasing = TRUE)

Arrange the rows of the data frame mosaicData::CPS85 in order, from the lowest to the highest wage. Break ties by experience (less experience coming before more experience).
Arrange the rows of the data frame mosaicData::CPS85 in order, from the lowest to the highest wage. Break ties by experience (more experience coming before less experience).
Review the all_walk() function from Section @ref(nested-loops). Write a function called all_walk_df() that, instead of returning the total number of flowers picked, returns a data frame that records the sequence of flowers picked by each person. You may omit the option for a report along the way. Recall that the colors of the flowers in the field were:

flower_colors <- c("blue", "red", "pink", "crimson", "orange")

A typical example of use would be:

all_walk_df(
  people = c("Dorothy", "Scarecrow"),
  favs = c("crimson", "blue"),
  numbers = c(2, 1)
)

       name  flower
1   Dorothy  orange
2   Dorothy crimson
3   Dorothy crimson
4 Scarecrow     red
5 Scarecrow    blue

7.8.5 Solutions to Practice Exercises

Here’s what you get:

order(creatures)

[1] 4 2 1 3

Here’s what you get:

order(creatures, decreasing = TRUE)

[1] 3 1 2 4

Here is one way:

CPS85[order(CPS85$wage, CPS85$exper), ]

Here is one way:

CPS85[order(CPS85$wage, CPS85$exper, 
            decreasing = c(FALSE, TRUE)), ]

Try this:

## helper-function to make df for one person:
walk_meadow_df <- function(person, color, wanted) {
  picking <- TRUE
  ## the following will be extended to hold the flowers picked:
  flowers_picked <- character()
  desired_count <- 0
  while (picking) {
    picked <- sample(flower_colors, size = 1)
    flowers_picked <- c(flowers_picked, picked)
    if (picked == color) desired_count <- desired_count + 1
    if (desired_count == wanted) picking <- FALSE
  }
  ## return the data frame:
  data.frame(
    name = rep(person, times = length(flowers_picked)),
    flower = flowers_picked
  )
}

all_walk_df <- function(people, favs, numbers) {
  ## start with a data frame with 0 rows
  ## and columns named correctly:
  df <- data.frame(
    name = character(),
    flower = character()
  )
  for (i in 1:length(people)) {
    person <- people[i]
    fav <- favs[i]
    number <- numbers[i]
    person_df <- walk_meadow_df(
      person = person,
      color = fav,
      wanted = number
    )
    ## extend df:
    df <- rbind(df, person_df)
  }
  ## return the complete data frame:
  df
}

The Main Ideas of This Chapter

Matrices are atomic vectors, with two additional attributes: number of rows, and number of columns.
Since matrices are vectors, you can subset them with the [-operator. You just have to account for rows and columns with a separating comma (e.g., myMatrix[3, 5]).
If you subset a matrix to get just one row or one column, then the result is “dropped” to an ordinary vector, unless you set drop to FALSE.
Arithmetic operations work pairwise on matrices, just like they do on vectors.
Like matrices, data frames are two dimensional, but their columns do not have to be all the same type of atomic vector.
You can access a column in a data frame with the $-operator (e.g., m111survey$fastest).
Subsetting data frames can be done with the [-operator, just like matrices.
You can also subset data frames with the subset() function.

Links to Slides

Quarto Presentations that I sometimes use in class:

Glossary

Matrix: An atomic vector that has two additional attributes: a number of rows and a number of columns.
Data Frame: A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.
Case (also called an Individual): An individual unit under study. In a data frame in R, the rows correspond to cases.
Variable (in Data Analysis): In data analysis, a variable is a measurement made on the individuals in a study.
Categorical Variable (in Data Analysis): In data analysis, a categorical variable is a variable whose values cannot be expressed meaningfully by numbers.

Exercises

Exercise 1

R has a function called t() that computes the transpose of a given matrix. This means that it switches around the rows and columns of the matrix, like this:

myMatrix <- matrix(1:24, nrow = 6)
myMatrix

     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24

t(myMatrix)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24

Write your own function called transpose() that will perform the same task on any given matrix. The function should take a single parameter called mat, the matrix to be transposed. Of course you may NOT use t() in the code for your function!

Hint: Let’s solve the problem in a general way, on an example.

First, we set up an example, naming it mat because that’s the required name of the parameter in the function we are supposed to write:

mat <- matrix(1:12, nrow = 2)

Here is mat:

mat

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    3    5    7    9   11
[2,]    2    4    6    8   10   12

Next, we break mat down into just the vector of its elements:

elements <- as.vector(mat)

Let’s take a look at the elements:

elements

 [1]  1  2  3  4  5  6  7  8  9 10 11 12

Recall that our target is this matrix:

t(mat)

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
[6,]   11   12

So we want to put the elements back into a matrix that has 2 rows and six columns. We need to do this in a general way:

matrix(elements, nrow = ncol(mat))

     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    4   10
[5,]    5   11
[6,]    6   12

This got the right number of rows and columns, but the elements need to be filled in across rows, not down columns, so instead let’s try:

matrix(elements, nrow = ncol(mat), byrow = TRUE)

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
[6,]   11   12

That worked!

So after we set up the example mat, the “work” we need to do is:

elements <- as.vector(mat)
matrix(elements, nrow = ncol(mat), byrow = TRUE)

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
[6,]   11   12

You take it from here: encapsulate this work into the required function, and test it on some examples.

Exercise 2

R has functions called rowSums() and colSums() that will respectively sum the rows and the columns of a matrix. Here is an example:

myMatrix <- matrix(1:24, nrow = 6)
rowSums(myMatrix)

[1] 40 44 48 52 56 60

Your task is to write your own function called dimSum() that will sum either the rows or the columns of a given matrix. The function should have two parameters:

mat: the matrix to be summed.
dim: the dimension to sum along, either rows or columns. The default value should be "rows". If the user sets dim to "columns" then the function would compute the column-sums.

You may NOT use rowSums() or colSums() in the code for your function. A typical example of use should look like this:

myMatrix <- matrix(1:24, nrow = 6)
dimSum(myMatrix)

[1] 40 44 48 52 56 60

dimSum(myMatrix, "columns")

[1]  21  57  93 129

Hint: Recall that in Practice 7.12 we made a function called myRowSums() that sums the rows of any given matrix. Modify the idea for myRowSums() to write a function called myColSums() that finds the column-sums of any given matrix. You may then use the two previously-created functions to write the required function dimSum().

Exercise 3

Starting with m111survey in the bcscr package, write the code necessary to create a new data frame called smaller that consists precisely of the male students who believe in extraterrestrial life and who are more than 68 inches tall. The new data frame should contain all of the original variables except for sex and extra_life.

Exercise 4

Write a function called dfRandSelect() that randomly selects (without replacement) a specified number of rows from a given data frame. The function should have two parameters:

df: the data frame from which to select;
n: the number of rows to select.

If n is greater than the number of rows in df, the function should return immediately with a message informing the user that the required task is not possible and informing him/her of the number of rows in df. Typical examples of use should be as follows:

dfRandSelect(bcscr::fuel, 5)

   speed efficiency
12   120       9.87
15   150      12.83
7     70       6.30
6     60       5.90
8     80       6.95

dfRandSelect(bcscr::fuel, 200)

No can do!  The frame has only 15 rows.

Hint: Review Listing 7.1 and Practice 7.31.

Exercise 5*

Create your own data frame, named myFrame. The frame should have 100 rows, along with the following variables:

lowerLetters: a character vector of randomly-produced 3-letter strings, like “chj”, “bbw”, and so on. The letters should all be lowercase.
height: a numerical vector consisting of real numbers chosen randomly between the values of 60 and 75.
sex: a factor whose possible value are “female” and “male”. Again, these values should be chosen randomly.

A call to str(myFrame) would come out like this (although your results will vary a bit since the vectors are constructed randomly):

str(myFrame)

'data.frame':   100 obs. of  3 variables:
 $ lowerLetters: chr  "usu" "uhl" "xyj" "uyd" ...
 $ height      : num  73.7 72.4 73.8 65.2 61.3 ...
 $ sex         : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 2 1 1 1 ...

summary() is useful when working with data frames. Here is how a call to summary(myFrame) might look:

summary(myFrame)

 lowerLetters           height          sex    
 Length:100         Min.   :60.00   female:57  
 Class :character   1st Qu.:63.63   male  :43  
 Mode  :character   Median :68.28              
                    Mean   :67.62              
                    3rd Qu.:71.63              
                    Max.   :74.57

Hint: If you have a vector of three letters, such as

vec <- c("g", "a", "r")

then you can paste them together as follows:

paste0(vec, collapse = "")

[1] "gar"

Exercise 6*

Study the data frame fuel in the bcscr package. Note that the fuel efficiency is reported as the number of liters of fuel required to travel 100 kilometers. Look up the conversion between gallons and liters and between kilometers and miles, and use this information to create a new variable called mpg that gives the fuel efficiency as miles per gallon. While you are at it, create a new variable mph that gives the speed in miles per hour. Finally, add these new variables to the fuel data frame.

Exercise 7*

Use matrices to generalize the simulation in the Appeals Court Paradox (see Section 6.6). Your goal is to write a simulation function called appealsSimPlus() that comes with all the options provided in the text, but with additional parameters so that the user can choose:

the number of judges on the court;
the probability for each judge to make a correct decision;
the voting pattern (how many votes each judge gets).

A typical call to the functions should look like this:

appealsSimPlus(
  reps = 10000,
  seed = 5252, 
  probs = c(0.95, 0.90, 0.90, 0.90, 0.80),
  votes = c(2, 1, 1, 1, 0)
)

In the above call the court consists of five judges. The best one decides cases correctly 95% of the time, three are right 90% of the time and one is right 80% of the time. The voting arrangement is that the best judge gets two votes, the next three get one vote each, and the worst gets no vote. Any voting scheme—even a scheme involving fractional votes—should be allowed so long as the votes add up to the number of judges.

Here is a hint. When you write the function it may be helpful to use the fact that rbinom() can take a prob parameter that is a vector of any length. Here’s an example:

results <- rbinom(6, size = 100, prob = c(0.10, 0.50, 0.90))
results

[1] 20 49 94 15 50 88

The first and fourth entries simulate a person tossing a fair coin 100 times when she has only a 10% chance of heads. The second and fifth entries simulate the same, when the chance of heads is 50%. The third and sixth simulate coin-tossing when there is a 90% chance of heads.

If you would like to arrange the results more nicely—say in a matrix where each column gives the results for a different person—you can do so:

resultsMat <- matrix(results, ncol = 3, byrow = TRUE)
resultsMat

     [,1] [,2] [,3]
[1,]   20   49   94
[2,]   15   50   88

Of course judges don’t flip a coin 100 times, they decide one case at a time. Suppose you have five judges with probabilities as follows:

probCorrect <- c(0.95, 0.90, 0.90, 0.90, 0.80)

If you would like to simulate the judges deciding, say, 6 cases, try this:

results <- rbinom(5*6, size = 1, prob= rep(probCorrect, 6))
resultsMat <- matrix(results, nrow = 6, byrow = TRUE)
resultsMat

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    0    1
[2,]    0    1    1    1    1
[3,]    1    1    1    1    1
[4,]    1    1    1    1    1
[5,]    1    1    1    1    1
[6,]    1    1    1    1    0

When it comes to applying the voting pattern to compute the decision in each case, consider matrix multiplication. For example, suppose that the pattern is:

votes <- c(2, 1, 1, 1, 0)

Then make votes a one-column matrix and perform matrix multiplication:

correctVotes <- resultsMat %*% matrix(votes, nrow = 5)
correctVotes

     [,1]
[1,]    4
[2,]    3
[3,]    5
[4,]    5
[5,]    5
[6,]    5

Think about how to encapsulate all of this into a nice, general simulation function.

Domain-specific languages (DSLs for short) stand in contrast to general-purpose programming languages that were designed to solve a wide variety of problems. Examples of important general-purpose languages include C and C++, Java, Python and Ruby. Although R is by now the one of the most widely-used DSLs in the world, there a number of other important ones, including Matlab and Otavefor scientific computing, Emacs Lisp for the renowned Emacs editor, and SQL for querying databases. JavaScript is an interesting case: it started out as a DSL for web browsers, but has since expanded to power many web applications and is now being used to develop desktop applications as well.↩︎
As an example outside of programming, consider what happens when you read a piece of literature “for structure.” You begin by asking: “What kind of literature is this? Is it drama, a novel, or something else?” The answer lets you know what to expect as you read: if it’s a novel, you know to suspend disbelief, whereas if it’s a journalistic piece then you know to examine critically whatever it presents as fact. Next, you might outline the piece. When you make an outline, you are breaking the piece up into parts, and indicating how the parts relate to each other to advance the plot and/or message of the piece. Note that in the process of “reading for structure” you are following the pattern of the definition of structure offered above.↩︎

7.1 Introduction to Matrices

7.1.1 Practice Exercises

7.2 Matrix Indexing

7.2.1 To Drop or Not?

7.2.2 Practice Exercises

7.3 Operations on Matrices

7.3.1 Arithmetical Operations

7.3.2 Logical Operations

7.3.3 Practice Exercises

7.4 Introduction to Data Frames

7.4.1 Viewing a Data Frame

Description

View

7.4.2 The Stucture of a Data Frame

7.4.3 Factors

7.4.4 Practice Exercises

Description

Format

Details

View

7.5 Creating Data Frames

7.5.1 Creation from Vectors

7.5.2 Creation From Other Frames

7.6 Subsetting Data Frames

7.6.1 Boolean Expressions

7.6.2 Practice Exercises

Description

Format

View

7.7 New Variables from Old

7.7.1 Getting Rid of Variables

7.7.2 Practice Exercises

7.8 More in Depth

7.8.1 Matrix Multiplication

7.8.2 Ordering Data Frames

7.8.3 Combining With rbind() and cbind()

7.8.4 Practice Exercises

7.8.5 Solutions to Practice Exercises

The Main Ideas of This Chapter

Links to Slides

Glossary

Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5*

Exercise 6*

Exercise 7*

7.8.3 Combining With `rbind()` and `cbind()`