7.6 Subsetting Data Frames
Our study of sub-setting matrices can be applied to the selection of parts of a data frame. As with a vector, one or both of the dimensions of the frame can come into play.
We can create a new data frame consisting of any columns we like from the original frame:
<- m111survey[, c("height", "ideal_ht")]
df head(df)
## height ideal_ht
## 1 76.0 78
## 2 74.0 76
## 3 64.0 NA
## 4 62.0 65
## 5 72.0 72
## 6 70.8 NA
If we select just one column, then the result is a vector rather than a data frame:
<- m111survey[, "height"]
df is.vector(df)
## [1] TRUE
If for some reason you want to prevent this, set drop
to FALSE
:
<- m111survey[, "height", drop =FALSE]
df head(df)
## height
## 1 76.0
## 2 74.0
## 3 64.0
## 4 62.0
## 5 72.0
## 6 70.8
You may select particular rows, too:
10:15, c("height", "ideal_ht")] m111survey[
## height ideal_ht
## 10 67 67
## 11 65 69
## 12 62 62
## 13 59 62
## 14 78 75
## 15 69 72
You can even select some of the rows at random. Here is a random sample of size six:
<- nrow(m111survey)
n <- m111survey[sample(1:n, size = 6, replace = FALSE), ]
df c("sex", "seat")] # show just two columns df[
## sex seat
## 13 female 1_front
## 54 male 2_middle
## 56 male 3_back
## 28 female 1_front
## 53 female 3_back
## 46 female 2_middle
Note the function nrow()
that gives the number of rows of the frame. When we sample six items without replacement from the vector 1:n
, we are picking six numbers at random from the row-numbers of the vector. Specifying these six numbers in the selection operator [
yields the desired random sample of rows.
7.6.1 Boolean Expressions
It is especially common to select rows by the values of a logical vector. For example, to select the rows where the fast speed ever driven is at least 150 miles per hour, try this:
<- m111survey[m111survey$fastest >= 150, ]
df c("sex", "fastest")] # show just two of the variables df[,
## sex fastest
## 8 male 160
## 32 male 190
When you are selecting rows it can be convenient to use the subset()
function. The first argument to the function is the frame from which you plan to select, and the second is the Boolean expression by which to select:
<- subset(m111survey, fastest >= 150)
df c("sex", "fastest")] df[,
## sex fastest
## 8 male 160
## 32 male 190
Note that we did not need to type m111survey$fastest
: the first argument to subset()
provides the environment in which to search for names that appear in the Boolean expression.
The Boolean sub-setting expressions can be quite complex:
<- subset(m111survey, seat == "3_back" & height < 72 & sex == "female")
df c("sex", "height", "seat")] df[,
## sex height seat
## 9 female 59 3_back
## 20 female 65 3_back
## 30 female 69 3_back
## 53 female 69 3_back
## 70 female 65 3_back
Note: subset()
takes a third parameter called select
that allows you to pick out any desired columns. For example:
subset(m111survey, seat == "3_back" & height < 72 & sex == "female",
select = c("sex", "height", "seat"))
## sex height seat
## 9 female 59 3_back
## 20 female 65 3_back
## 30 female 69 3_back
## 53 female 69 3_back
## 70 female 65 3_back
7.6.2 Practice Exercises
We’ll use the CPS85
data frame from the mosaicData package. You should go ahead and load the package and then read about the data frame:
library(mosaicData)
?CPS85
Each row in the data frame corresponds to an employee in the survey.
Write a command that gives the number of employees in the data frame.
Select the employees who are between 40 and 50 years old.
Select the employees who are married and have fewer than 30 years of experience.
Select the nonunion employees who either live in the South or who have more than 12 years of education (or both).
Select the employees who work in the clerical, construction, management or professional sector.
Select the employees who make more than 30 dollars per hour, and keep only their wage, sex and sector of employment
Select 10 employees at random, keeping only their wage and sex.
Select all of the employees, keeping all information about them except for their union status and whether or not they are from the South.
7.6.3 Solutions to Practice Exercises
The command is
nrow(CPS85)
.Try this:
subset(CPS85, age > 40 & age < 50)
Try this:
subset(CPS85, married == "Married" & exper < 30)
Try this:
subset(CPS85, union == "Not" & (south == "S" | educ > 12))
Try this:
subset(CPS85, sector %in% c("clerical", "construction", "management", "professional"))
Try this:
$wage > 30, c("wage", "sex", "sector")] CPS85[CPS85
Try this:
sample(1:nrow(CPS85), size = 10, replace = FALSE), CPS85[c("wage", "sex")]
Try this (
south
andunion
are columns 6 and 9, respectively):-c(6, 9)] CPS85[ ,
The
select
parameter of thesubset()
function has a little known feature that allows you to specify columns to omit by name, so the following is another solution:subset(CPS85, select = -c(south, union))