10.4 Subsetting with dplyr

The dplyr function filter() is the rough equivalent of select(): it picks out rows of a data frame (or similar objects such as a tibble). The dplyr function select() subsets for columns.

Thus you can use the two functions together to do perform sub-setting. With the pipe operator, your code can be quite easy to read:

survey %>% 
  filter((sex == "male" & height > 70) | (sex =="female" & height < 55)) %>% 
  select(sex, height, fastest)
## # A tibble: 22 x 3
##    sex    height fastest
##    <fct>   <dbl>   <int>
##  1 male     76       119
##  2 male     74       110
##  3 male     72        95
##  4 male     70.8     100
##  5 male     79       160
##  6 male     73       110
##  7 male     73       120
##  8 female   54       130
##  9 male     74       119
## 10 male     72       125
## # … with 12 more rows

Note that dplyr data-functions like filter() and select() take a data table as their first argument, and return a data table as well. Hence they may be chained together as we saw in the above example.

With select() it’s easy to leave out columns, too:

survey %>% 
  select(-ideal_ht, -love_first)
## # A tibble: 71 x 10
##    height sleep fastest weight_feel extra_life seat    GPA enough_Sleep sex  
##     <dbl> <dbl>   <int> <fct>       <fct>      <fct> <dbl> <fct>        <fct>
##  1   76     9.5     119 1_underwei… yes        1_fr…  3.56 no           male 
##  2   74     7       110 2_about_ri… yes        2_mi…  2.5  no           male 
##  3   64     9        85 2_about_ri… no         2_mi…  3.8  no           fema…
##  4   62     7       100 1_underwei… no         1_fr…  3.5  no           fema…
##  5   72     8        95 1_underwei… yes        3_ba…  3.2  no           male 
##  6   70.8  10       100 3_overweig… no         1_fr…  3.1  yes          male 
##  7   70     4        85 2_about_ri… yes        1_fr…  3.68 no           male 
##  8   79     6       160 2_about_ri… yes        3_ba…  2.7  yes          male 
##  9   59     7        90 2_about_ri… yes        3_ba…  2.8  no           fema…
## 10   67     7        90 3_overweig… no         2_mi… NA    yes          fema…
## # … with 61 more rows, and 1 more variable: diff.ideal.act. <dbl>

10.4.1 Practice Exercises

  1. Can you use the pipe to chain dplyr functions along with nrow() to find out how many people in survey believe in love at first sight and drove more than 120 miles per hour?

  2. Find the three largest heights of the males who drove more than 120 miles per hour.

  3. Use the pipe and filter() to make violin plots of the wages of men and women in CPS85, where the outlier-person (whose wage was more than 40 dollars per hour) has been eliminated prior to making the graph.

10.4.2 Solutions to Practice Exercises

  1. Try this:

    survey %>% 
      filter(love_first == "yes" & fastest > 120) %>% 
      nrow()
    ## [1] 3
  2. Here’s one way:

    survey %>% 
      filter(sex == "male" & fastest > 120) %>%
      .$height %>%                 # this is just a vector
      sort(decreasing = TRUE) %>%  # so you can sort it ... 
      .[1:3]                       # then get its first three elements
    ## [1] 79 75 75
  3. Try this code:

    CPS85 %>% 
      filter(wage <= 40) %>% 
      ggplot(aes(x = sex, y = wage)) +
        geom_violin(fill = "burlywood")