7.6 Subsetting Data Frames

Our study of sub-setting matrices can be applied to the selection of parts of a data frame. As with a vector, one or both of the dimensions of the frame can come into play.

We can create a new data frame consisting of any columns we like from the original frame:

df <- m111survey[, c("height", "ideal_ht")]
head(df)
##   height ideal_ht
## 1   76.0       78
## 2   74.0       76
## 3   64.0       NA
## 4   62.0       65
## 5   72.0       72
## 6   70.8       NA

If we select just one column, then the result is a vector rather than a data frame:

df <- m111survey[, "height"]
is.vector(df)
## [1] TRUE

If for some reason you want to prevent this, set drop to FALSE:

df <- m111survey[, "height", drop =FALSE]
head(df)
##   height
## 1   76.0
## 2   74.0
## 3   64.0
## 4   62.0
## 5   72.0
## 6   70.8

You may select particular rows, too:

m111survey[10:15, c("height", "ideal_ht")]
##    height ideal_ht
## 10     67       67
## 11     65       69
## 12     62       62
## 13     59       62
## 14     78       75
## 15     69       72

You can even select some of the rows at random. Here is a random sample of size six:

n <- nrow(m111survey)
df <- m111survey[sample(1:n, size = 6, replace = FALSE), ]
df[c("sex", "seat")]  # show just two columns
##       sex     seat
## 13 female  1_front
## 54   male 2_middle
## 56   male   3_back
## 28 female  1_front
## 53 female   3_back
## 46 female 2_middle

Note the function nrow() that gives the number of rows of the frame. When we sample six items without replacement from the vector 1:n, we are picking six numbers at random from the row-numbers of the vector. Specifying these six numbers in the selection operator [ yields the desired random sample of rows.

7.6.1 Boolean Expressions

It is especially common to select rows by the values of a logical vector. For example, to select the rows where the fast speed ever driven is at least 150 miles per hour, try this:

df <- m111survey[m111survey$fastest >= 150, ]
df[, c("sex", "fastest")]  # show just two of the variables
##     sex fastest
## 8  male     160
## 32 male     190

When you are selecting rows it can be convenient to use the subset() function. The first argument to the function is the frame from which you plan to select, and the second is the Boolean expression by which to select:

df <- subset(m111survey, fastest >= 150)
df[, c("sex", "fastest")] 
##     sex fastest
## 8  male     160
## 32 male     190

Note that we did not need to type m111survey$fastest: the first argument to subset() provides the environment in which to search for names that appear in the Boolean expression.

The Boolean sub-setting expressions can be quite complex:

df <- subset(m111survey, seat == "3_back" & height < 72 & sex == "female")
df[, c("sex", "height", "seat")]
##       sex height   seat
## 9  female     59 3_back
## 20 female     65 3_back
## 30 female     69 3_back
## 53 female     69 3_back
## 70 female     65 3_back

Note: subset() takes a third parameter called select that allows you to pick out any desired columns. For example:

subset(m111survey, seat == "3_back" & height < 72 & sex == "female",
       select = c("sex", "height", "seat"))
##       sex height   seat
## 9  female     59 3_back
## 20 female     65 3_back
## 30 female     69 3_back
## 53 female     69 3_back
## 70 female     65 3_back

7.6.2 Practice Exercises

We’ll use the CPS85 data frame from the mosaicData package. You should go ahead and load the package and then read about the data frame:

library(mosaicData)
?CPS85

Each row in the data frame corresponds to an employee in the survey.

  1. Write a command that gives the number of employees in the data frame.

  2. Select the employees who are between 40 and 50 years old.

  3. Select the employees who are married and have fewer than 30 years of experience.

  4. Select the nonunion employees who either live in the South or who have more than 12 years of education (or both).

  5. Select the employees who work in the clerical, construction, management or professional sector.

  6. Select the employees who make more than 30 dollars per hour, and keep only their wage, sex and sector of employment

  7. Select 10 employees at random, keeping only their wage and sex.

  8. Select all of the employees, keeping all information about them except for their union status and whether or not they are from the South.

7.6.3 Solutions to Practice Exercises

  1. The command is nrow(CPS85).

  2. Try this:

    subset(CPS85, age > 40 & age < 50)
  3. Try this:

    subset(CPS85, married == "Married" & exper < 30)
  4. Try this:

    subset(CPS85, union == "Not" & (south == "S" | educ > 12))
  5. Try this:

    subset(CPS85, sector %in% c("clerical", "construction",
                                "management", "professional"))
  6. Try this:

    CPS85[CPS85$wage > 30, c("wage", "sex", "sector")]
  7. Try this:

    CPS85[sample(1:nrow(CPS85), size = 10, replace = FALSE),
          c("wage", "sex")]
  8. Try this (south and union are columns 6 and 9, respectively):

    CPS85[ , -c(6, 9)]

    The select parameter of the subset() function has a little known feature that allows you to specify columns to omit by name, so the following is another solution:

    subset(CPS85, select = -c(south, union))