10.6 Grouping and Summaries

The next two dplyr data-functions are useful for generating numerical summaries of data.

Consider, for example, CPS85. We know from graphical studies that the men in the study are paid more than women, but how might we verify this fact numerically? One approach would be to separate the men and the women into two different groups and compute the mean wage for each group. This is accomplished by calling group_by() and summarise() in succession:

CPS85 %>% 
  group_by(sex) %>% 
  summarize(meanWage = mean(wage))
## # A tibble: 2 x 2
##   sex   meanWage
##   <fct>    <dbl>
## 1 F         7.88
## 2 M         9.99

It’s possible to create more than one summary variable in a single call to summarise(), for example:

CPS85 %>% 
  group_by(sex) %>% 
  summarize(meanWage = mean(wage),
            n = n())
## # A tibble: 2 x 3
##   sex   meanWage     n
##   <fct>    <dbl> <int>
## 1 F         7.88   245
## 2 M         9.99   289

In the previous example, dplyr::n() was used to count the number of cases in each group.

For a more complete account of a numerical variable, one might consider the five-number summary:

  • the minimum value
  • the first quartile (Q1)
  • the median
  • the third quartile (Q3)
  • the maximum value

These quantities are conveniently computed by R’s fivenum() function:

CPS85 %>% 
  .$wage %>% 
  fivenum()
## [1]  1.00  5.25  7.78 11.25 44.50

Let’s find the five number summaries for the wages of men and women:

CPS85 %>%
  group_by(sex) %>% 
  summarise(n = n(),
            min = fivenum(wage)[1],
            Q1 = fivenum(wage)[2],
            median = fivenum(wage)[3],
            Q3 = fivenum(wage)[4],
            max = fivenum(wage)[5])
## # A tibble: 2 x 7
##   sex       n   min    Q1 median    Q3   max
##   <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 F       245  1.75  4.75   6.8     10  44.5
## 2 M       289  1     6      8.93    13  26.3

It’s also possible to group by more than one variable at a time. For example, suppose that we wish to compare the wages of men and women in the various sectors of employment. All we need to do is group by both sex and sector:

CPS85 %>% 
  group_by(sector, sex) %>% 
  summarise(n = n(),
            min = fivenum(wage)[1],
            Q1 = fivenum(wage)[2],
            median = fivenum(wage)[3],
            Q3 = fivenum(wage)[4],
            max = fivenum(wage)[5])
## # A tibble: 15 x 8
## # Groups:   sector [8]
##    sector   sex       n   min    Q1 median    Q3   max
##    <fct>    <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 clerical F        76  3     5.1    7     9.55 15.0 
##  2 clerical M        21  3.35  6      7.69  9    12   
##  3 const    M        20  3.75  7.15   9.75 11.8  15   
##  4 manag    F        21  3.64  6.88  10    11.2  44.5 
##  5 manag    M        34  1     8.8   14.0  18.2  26.3 
##  6 manuf    F        24  3     4.36   4.9   6.05 18.5 
##  7 manuf    M        44  3.35  6.58   8.94 11.2  22.2 
##  8 other    F         6  3.75  4      5.62  6.88  8.93
##  9 other    M        62  2.85  5.25   7.5  11.2  26   
## 10 prof     F        52  4.35  7.02  10    12.3  25.0 
## 11 prof     M        53  5     8     12    16.4  25.0 
## 12 sales    F        17  3.35  3.8    4.55  5.65 14.3 
## 13 sales    M        21  3.5   5.56   9.42 12.5  20.0 
## 14 service  F        49  1.75  3.75   5     8    13.1 
## 15 service  M        34  2.01  4.15   5.89  8.75 25

Note that there were no women in the construction sector, so that group did not appear in the summary.

10.6.1 Note on Binding

Keep in mind that you can always “save” the results of any computation by binding them to a variable name, thus:

sexSector <-
  CPS85 %>% 
  group_by(sector, sex) %>% 
  summarise(n = n(),
            min = fivenum(wage)[1],
            Q1 = fivenum(wage)[2],
            median = fivenum(wage)[3],
            Q3 = fivenum(wage)[4],
            max = fivenum(wage)[5])
class(sexSector)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Note that the result has data.frame as one of its classes, so you may extract components in any of the ways you have learned. The old ways, for instance, are fine:

# minimum wage among male professionals:
with(sexSector, min[sex == "M" & sector == "prof"])
## [1] 5

10.6.2 Practice Exercises

These exercises deal with flight data from the nycflights13 data frame:

data("flights", package = "nycflights13")
  1. The flights table gives information about each departure in the year 2013 from one of the three major airports near New York City: John F. Kennedy (JFK), LaGuardia (LGA) or Newark (EWR). The airport from which the plane departed is recorded in the variable origin. The variable dep_delay gives the delay in departure, in minutes. (This is a negative number if the plane left early). Find the number of departures and the mean departure delay for each of the three airports. (Note that dep_delay for cancelled flights will be NA.)

  2. The variable distance gives the distance, in miles, between an origin and destination airport. For July 26, 2013, make a violin plot of the distances traveled by the departing planes from the each of the three New York airports. Use the pipe and filter() to take flights into the desired plot.

  3. Examine the plot you made in the previous problem: two of the flights appear to be about 5000 miles. Use the pipe, filter() and select() to display the origin, destination and distance for these two flights.

10.6.3 Solutions to Practice Exercises

  1. Flights that were cancelled have NA for their departure delay, so we need to filter out these cases first, in order to correctly count the number of flights that actually left the airport. Try this:

    flights %>% 
      filter(!is.na(dep_delay)) %>% 
      group_by(origin) %>% 
      summarise(departures = n(),
                meanDelay = mean(dep_delay))
    ## # A tibble: 3 x 3
    ##   origin departures meanDelay
    ##   <chr>       <int>     <dbl>
    ## 1 EWR        117596      15.1
    ## 2 JFK        109416      12.1
    ## 3 LGA        101509      10.3
  2. Try this:

    flights %>% 
      filter(month == 6 & day == 26) %>% 
      ggplot(aes(x = origin, y = distance)) +
        geom_violin(fill = "burlywood") +
        geom_jitter(width = 0.25, size = 0.1)
  3. Try this:

    flights %>% 
      filter(month == 6 & day == 26 & distance > 4000) %>% 
      select(origin, dest, distance)
    ## # A tibble: 2 x 3
    ##   origin dest  distance
    ##   <chr>  <chr>    <dbl>
    ## 1 JFK    HNL       4983
    ## 2 EWR    HNL       4963