10.6 Grouping and Summaries
The next two dplyr data-functions are useful for generating numerical summaries of data.
Consider, for example, CPS85
. We know from graphical studies that the men in the study are paid more than women, but how might we verify this fact numerically? One approach would be to separate the men and the women into two different groups and compute the mean wage for each group. This is accomplished by calling group_by()
and summarise()
in succession:
%>%
CPS85 group_by(sex) %>%
summarize(meanWage = mean(wage))
## # A tibble: 2 x 2
## sex meanWage
## <fct> <dbl>
## 1 F 7.88
## 2 M 9.99
It’s possible to create more than one summary variable in a single call to summarise()
, for example:
%>%
CPS85 group_by(sex) %>%
summarize(meanWage = mean(wage),
n = n())
## # A tibble: 2 x 3
## sex meanWage n
## <fct> <dbl> <int>
## 1 F 7.88 245
## 2 M 9.99 289
In the previous example, dplyr::n()
was used to count the number of cases in each group.
For a more complete account of a numerical variable, one might consider the five-number summary:
- the minimum value
- the first quartile (Q1)
- the median
- the third quartile (Q3)
- the maximum value
These quantities are conveniently computed by R’s fivenum()
function:
%>%
CPS85 $wage %>%
.fivenum()
## [1] 1.00 5.25 7.78 11.25 44.50
Let’s find the five number summaries for the wages of men and women:
%>%
CPS85 group_by(sex) %>%
summarise(n = n(),
min = fivenum(wage)[1],
Q1 = fivenum(wage)[2],
median = fivenum(wage)[3],
Q3 = fivenum(wage)[4],
max = fivenum(wage)[5])
## # A tibble: 2 x 7
## sex n min Q1 median Q3 max
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 245 1.75 4.75 6.8 10 44.5
## 2 M 289 1 6 8.93 13 26.3
It’s also possible to group by more than one variable at a time. For example, suppose that we wish to compare the wages of men and women in the various sectors of employment. All we need to do is group by both sex
and sector
:
%>%
CPS85 group_by(sector, sex) %>%
summarise(n = n(),
min = fivenum(wage)[1],
Q1 = fivenum(wage)[2],
median = fivenum(wage)[3],
Q3 = fivenum(wage)[4],
max = fivenum(wage)[5])
## # A tibble: 15 x 8
## # Groups: sector [8]
## sector sex n min Q1 median Q3 max
## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 clerical F 76 3 5.1 7 9.55 15.0
## 2 clerical M 21 3.35 6 7.69 9 12
## 3 const M 20 3.75 7.15 9.75 11.8 15
## 4 manag F 21 3.64 6.88 10 11.2 44.5
## 5 manag M 34 1 8.8 14.0 18.2 26.3
## 6 manuf F 24 3 4.36 4.9 6.05 18.5
## 7 manuf M 44 3.35 6.58 8.94 11.2 22.2
## 8 other F 6 3.75 4 5.62 6.88 8.93
## 9 other M 62 2.85 5.25 7.5 11.2 26
## 10 prof F 52 4.35 7.02 10 12.3 25.0
## 11 prof M 53 5 8 12 16.4 25.0
## 12 sales F 17 3.35 3.8 4.55 5.65 14.3
## 13 sales M 21 3.5 5.56 9.42 12.5 20.0
## 14 service F 49 1.75 3.75 5 8 13.1
## 15 service M 34 2.01 4.15 5.89 8.75 25
Note that there were no women in the construction sector, so that group did not appear in the summary.
10.6.1 Note on Binding
Keep in mind that you can always “save” the results of any computation by binding them to a variable name, thus:
<-
sexSector %>%
CPS85 group_by(sector, sex) %>%
summarise(n = n(),
min = fivenum(wage)[1],
Q1 = fivenum(wage)[2],
median = fivenum(wage)[3],
Q3 = fivenum(wage)[4],
max = fivenum(wage)[5])
class(sexSector)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Note that the result has data.frame
as one of its classes, so you may extract components in any of the ways you have learned. The old ways, for instance, are fine:
# minimum wage among male professionals:
with(sexSector, min[sex == "M" & sector == "prof"])
## [1] 5
10.6.2 Practice Exercises
These exercises deal with flight data from the nycflights13 data frame:
data("flights", package = "nycflights13")
The
flights
table gives information about each departure in the year 2013 from one of the three major airports near New York City: John F. Kennedy (JFK), LaGuardia (LGA) or Newark (EWR). The airport from which the plane departed is recorded in the variableorigin
. The variabledep_delay
gives the delay in departure, in minutes. (This is a negative number if the plane left early). Find the number of departures and the mean departure delay for each of the three airports. (Note thatdep_delay
for cancelled flights will beNA
.)The variable
distance
gives the distance, in miles, between an origin and destination airport. For July 26, 2013, make a violin plot of the distances traveled by the departing planes from the each of the three New York airports. Use the pipe andfilter()
to takeflights
into the desired plot.Examine the plot you made in the previous problem: two of the flights appear to be about 5000 miles. Use the pipe,
filter()
andselect()
to display the origin, destination and distance for these two flights.
10.6.3 Solutions to Practice Exercises
Flights that were cancelled have
NA
for their departure delay, so we need to filter out these cases first, in order to correctly count the number of flights that actually left the airport. Try this:%>% flights filter(!is.na(dep_delay)) %>% group_by(origin) %>% summarise(departures = n(), meanDelay = mean(dep_delay))
## # A tibble: 3 x 3 ## origin departures meanDelay ## <chr> <int> <dbl> ## 1 EWR 117596 15.1 ## 2 JFK 109416 12.1 ## 3 LGA 101509 10.3
Try this:
%>% flights filter(month == 6 & day == 26) %>% ggplot(aes(x = origin, y = distance)) + geom_violin(fill = "burlywood") + geom_jitter(width = 0.25, size = 0.1)
Try this:
%>% flights filter(month == 6 & day == 26 & distance > 4000) %>% select(origin, dest, distance)
## # A tibble: 2 x 3 ## origin dest distance ## <chr> <chr> <dbl> ## 1 JFK HNL 4983 ## 2 EWR HNL 4963