10.4 Subsetting with dplyr
The dplyr function filter()
is the rough equivalent of select()
: it picks out rows of a data frame (or similar objects such as a tibble). The dplyr function select()
subsets for columns.
Thus you can use the two functions together to do perform sub-setting. With the pipe operator, your code can be quite easy to read:
%>%
survey filter((sex == "male" & height > 70) | (sex =="female" & height < 55)) %>%
select(sex, height, fastest)
## # A tibble: 22 x 3
## sex height fastest
## <fct> <dbl> <int>
## 1 male 76 119
## 2 male 74 110
## 3 male 72 95
## 4 male 70.8 100
## 5 male 79 160
## 6 male 73 110
## 7 male 73 120
## 8 female 54 130
## 9 male 74 119
## 10 male 72 125
## # … with 12 more rows
Note that dplyr data-functions like filter()
and select()
take a data table as their first argument, and return a data table as well. Hence they may be chained together as we saw in the above example.
With select()
it’s easy to leave out columns, too:
%>%
survey select(-ideal_ht, -love_first)
## # A tibble: 71 x 10
## height sleep fastest weight_feel extra_life seat GPA enough_Sleep sex
## <dbl> <dbl> <int> <fct> <fct> <fct> <dbl> <fct> <fct>
## 1 76 9.5 119 1_underwei… yes 1_fr… 3.56 no male
## 2 74 7 110 2_about_ri… yes 2_mi… 2.5 no male
## 3 64 9 85 2_about_ri… no 2_mi… 3.8 no fema…
## 4 62 7 100 1_underwei… no 1_fr… 3.5 no fema…
## 5 72 8 95 1_underwei… yes 3_ba… 3.2 no male
## 6 70.8 10 100 3_overweig… no 1_fr… 3.1 yes male
## 7 70 4 85 2_about_ri… yes 1_fr… 3.68 no male
## 8 79 6 160 2_about_ri… yes 3_ba… 2.7 yes male
## 9 59 7 90 2_about_ri… yes 3_ba… 2.8 no fema…
## 10 67 7 90 3_overweig… no 2_mi… NA yes fema…
## # … with 61 more rows, and 1 more variable: diff.ideal.act. <dbl>
10.4.1 Practice Exercises
Can you use the pipe to chain dplyr functions along with
nrow()
to find out how many people insurvey
believe in love at first sight and drove more than 120 miles per hour?Find the three largest heights of the males who drove more than 120 miles per hour.
Use the pipe and
filter()
to make violin plots of the wages of men and women inCPS85
, where the outlier-person (whose wage was more than 40 dollars per hour) has been eliminated prior to making the graph.
10.4.2 Solutions to Practice Exercises
Try this:
%>% survey filter(love_first == "yes" & fastest > 120) %>% nrow()
## [1] 3
Here’s one way:
%>% survey filter(sex == "male" & fastest > 120) %>% $height %>% # this is just a vector .sort(decreasing = TRUE) %>% # so you can sort it ... 1:3] # then get its first three elements .[
## [1] 79 75 75
Try this code:
%>% CPS85 filter(wage <= 40) %>% ggplot(aes(x = sex, y = wage)) + geom_violin(fill = "burlywood")