10 Basic Tidyverse Concepts
In this chapter we will introduce a few tools from the tidyverse set of R-packages:
- the pipe operator
%>%
for chaining function calls in a convenient and readable way; - the
tibble
class, a variant of the data frame that is especially suitable for large data sets; - data manipulation functions from the dplyr package suitable for use with the pipe operator:
filter()
andselect()
for sub-setting;mutate()
for transforming variables;group_by()
andsummarise()
for numerical summaries of data.
10.1 The Tidyverse
The tidyverse isn’t a package, exactly—it’s a collection of packages. Go ahead and attach it:
library(tidyverse)
You’ll get an account of the packages that have been attached. We have worked before with ggplot and by the end of CSC 215 we will have worked with all of the others. You need not worry about the fact that filter()
and lag()
mask functions from the stats package.
10.2 The magrittr Pipe Operator
In Section 6.8.4 you met R’s native pipe operator |>
. The tidyverse uses another pipe operator, %>%
from the magrittr package.1.
The keyboard shortcut for %>%
is:
- Ctrl+Shift+M on Windows/Linux, or
- Cmd+Shift+M on Mac OS.
Like the native R pipe operator, %>%
connects two function calls by making the value returned by the first call the first argument of the second call. Here’s an example:
"hello" %>% rep(times = 4)
[1] "hello" "hello" "hello" "hello"
This is the same as the more familiar:
rep("hello", times = 4)
[1] "hello" "hello" "hello" "hello"
Here’s another example:
# same as nrow(bcscr::m111survey)
::m111survey %>% nrow() bcscr
[1] 71
Here’s two pipes:
"hello" %>% rep(times = 4) %>% length()
[1] 4
By default the value of the left-hand call is piped into the right-hand call as the first argument. You can make it some other argument using the dot .
as a placeholder, for example:
4 %>% rep("hello", times = .)
[1] "hello" "hello" "hello" "hello"
(Recall that the placeholder for the native R pipe is _
. Do not interchange _
with .
)
Since sub-setting is actually a function call under the hood, you can use the dot there, too:
# gets the third element of the sequence 1, 4, 9, ..., 97:
seq(1, 100, by = 4) %>% .[3]
[1] 9
The pipe operator isn’t all that useful when you only use it once or twice in succession. Its true value becomes apparent in the chaining together of many manipulations involving data frames.
10.2.1 Practice Exercises
10.3 Tibbles
The tibble package gives us tibbles, which are very nearly the same thing as a data frame. Indeed, the name “tibble” is supposed to remind us of a data “table.”
Consider the class of bcscr::m111survey
:
class(bcscr::m111survey)
[1] "data.frame"
Yep, it’s a data frame. But we can convert it to a tibble, as follows:
<- as_tibble(bcscr::m111survey)
survey class(survey)
[1] "tbl_df" "tbl" "data.frame"
You can treat tibbles like data frames. For now the primary practical difference is manifest when you print a tibble to the Console:
The output is automatically truncated, and the number of columns printed is determined by the width of your screen. This is a great convenience when one is dealing with larger data sets.
Many larger data tables in packages will come to you as tibbles.
10.4 Subsetting with dplyr
The dplyr function filter()
is the rough equivalent of select()
: it picks out rows of a data frame (or similar objects such as a tibble). The dplyr function select()
subsets for columns.
Thus you can use the two functions together to do perform sub-setting. With the pipe operator, your code can be quite easy to read:
Note that dplyr data-functions like filter()
and select()
take a data table as their first argument, and return a data table as well. Hence they may be chained together as we saw in the above example.
With select()
it’s easy to leave out columns, too:
10.4.1 Practice Exercises
10.5 Transforming Variables with dplyr
In dplyr you transform variables with the function mutate()
. Here is an example:
In mutate()
there is always a variable-name on the left-hand side of the =
sign. It could be the same as an existing variable in the table if you are content to overwrite that variable. On the right side of the =
is a function that can depend on variables in the data table.
You can transform more than one variable in a single call to mutate()
, as in the code below. Try it!
10.5.1 Practice Exercises
10.6 Grouping and Summaries
The next two dplyr data-functions are useful for generating numerical summaries of data.
Consider, for example, CPS85
. We know from graphical studies that the men in the study are paid more than women, but how might we verify this fact numerically? One approach would be to separate the men and the women into two different groups and compute the mean wage for each group. This is accomplished by calling group_by()
and summarise()
in succession:
It’s possible to create more than one summary variable in a single call to summarise()
, for example:
In the previous example, dplyr::n()
was used to count the number of cases in each group.
For a more complete account of a numerical variable, one might consider the five-number summary:
- the minimum value
- the first quartile (Q1)
- the median
- the third quartile (Q3)
- the maximum value
These quantities are conveniently computed by R’s fivenum()
function:
%>%
CPS85 $wage %>%
.fivenum()
[1] 1.00 5.25 7.78 11.25 44.50
Let’s find the five number summaries for the wages of men and women:
It’s also possible to group by more than one variable at a time. For example, suppose that we wish to compare the wages of men and women in the various sectors of employment. All we need to do is group by both sex
and sector
:
Note that there were no women in the construction sector, so that group did not appear in the summary.
10.6.1 Note on Binding
Keep in mind that you can always “save” the results of any computation by binding them to a variable name, thus:
Note that the result has data.frame
as one of its classes, so you may extract components in any of the ways you have learned. The old ways, for instance, are fine:
<-
sexSector %>%
CPS85 group_by(sector, sex) %>%
summarise(
n = n(),
min = fivenum(wage)[1],
Q1 = fivenum(wage)[2],
median = fivenum(wage)[3],
Q3 = fivenum(wage)[4],
max = fivenum(wage)[5]
)# minimum wage among male professionals:
with(sexSector, min[sex == "M" & sector == "prof"])
[1] 5
10.6.2 Practice Exercises
These exercises deal with the
flights
data table from the nycflights13 package.
10.7 R’s Native Pipe
The magritter pipe became so popular that the R Core Team decided to include a pipe operator in R’s base package. It looks like this: |>
.
For our purposes, R’s “native” pipe can be used interchangeably with %>%
, except that the placeholder is indicated by the underscore (_
) instead of a period.
A few examples:
Links to Class Slides
Quarto Presentations that I sometimes use in class:
Exercises
Exercise 1
Use the pipe operator to rewrite the following command in three ways:
runif(10, min = 0, max = 5)
Exercise 2
Rewrite the following command using two pipe operators in succession:
paste("hello", "there", "Bella")
[1] "hello there Bella"
Exercise 3
Use the pipe operator and dplyr functions to rewrite the following command:
head(subset(m111survey, sex == "female")[, c("height", "fastest")],6)
Exercise 4
This and the next exercise are about the
babynames
data frame from the babynames package.
Find the names for females born in 2015 that were given to more than 1% of female applicants (i.e., prop
is bigger than 0.01).
Exercise 5
Use the pipe operator and dplyr functions to produce the following graph of the popularity of “Mary” and “Mia” as girl-names over the years. Note that popularity is given as number per one thousand applicants, i.e., as prop * 1000
.
Note carefully that:
- we want only the girls who have the above names;
- the y-axis shows the number of girl-babies with the given name per 1000 girls born that year—not the absolute number and not the proportion.
Hint: Review Practice 10.6.
magrittr is not attached in the tidy-verse, but much of the capability of this package is imported by dplyr, which is one the tidyverse packages. Historically, the magrittr pipe was the inspiration for the “native” R pip
|>
.↩︎