14 Functional Programming in R

It was simple, but you know, it’s always simple when you’ve done it.

—Simone Gabbriellini

In this Chapter we aren’t going to cover any fundamentally new R powers. Instead we’ll get acquainted with just one aspect of a computer programming paradigm known as functional programming. We will examine a set of R-functions for which functions themselves are supplied as arguments. These functions allow us to accomplish a great deal of computation in rather concise and expressive code. Not only are they useful in R itself, but they help you to reason abstractly about computation and prepare you for functional-programming aspects of other programming languages.

14.1 Programming Paradigms

Let us begin by exploring the notion of a programming paradigm in general. We will go on in this Chapter to consider two programming paradigms for which R provides considerable support. In the next Chapter we will consider a third programming paradigm that exists in R.

A programming paradigm is a way to describe some of the features of programming languages. Often a paradigm includes principles concerning the use of these features, or embodies a view that these features have special importance and utility in good programming practice.

14.1.1 Procedural Programming

One of the older programming paradigms in existence is procedural programming. It is supported in many popular languages and is often the first paradigm within which beginners learn to program. In fact, if one’s programming does not progress beyond a rudimentary level, one may never become aware that one is working within the procedural paradigm—or any paradigm at all, for that matter.

Before we define procedural programming, let’s illustrate it with an example. Almost any of the programs we have written so far would do as examples; for specificity, let’s consider the following snippet of code that produces from the data frame m111survey a new, smaller frame consisting of just the numerical variables:

# find the numer of columns in the data frame:
cols <- length(names(m111survey))
#set up a logical vector of length equal to the number of columns:
is_numerical <- logical(cols)

# loop through.  For each variable, say if it is numerical:
for (i in seq_along(is_numerical)) {
  is_numerical[i] <- is.numeric(m111survey[, i])
}

# pick the numerical variables from the data frame
num_summ_111 <- m111survey[, is_numerical]
# have a look at the result:
str(num_summ_111)
## 'data.frame':    71 obs. of  6 variables:
##  $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
##  $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
##  $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
##  $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
##  $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
##  $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

By now there is nothing mysterious about the above code-snippet. What we want to become conscious of is the approach we have taken to the problem of selecting the numerical variables. In particular, observe that:

  • We worked throughout with data, some of which, like m111survey, was given to us and some of which we created on our own to help solve the problem. For example, we created the variable cols. Note also the very helpful index-variable i in the for-loop. We set up the data structure isNumerical in order to hold a set of data (TRUEs and FALSEs).
  • We relied on various procedures to create data and to manipulate that data in order to produce the desired result. Some of the procedures appeared as special blocks of code—most notably the for-loop. Other procedures took the form of functions. As we know, a function encapsulates a useful procedure so that it can be easily reused in a wide variety of circumstances, without the user having to know the details of how it works. We know that names() will give us the vector of names of the columns of m111survey, that length() will tell us how many names, there are, that is.numeric() will tell us whether or not a given variable in m111survey is a numerical variable, and so on. The procedures embodied in these functions were written by other folks and we could examine them if we had the time and interest, but for the most part we are content simply to know how to access them.

Procedural programming is a paradigm that solves problems with programs that can be broken up into collections of variables, data structures and procedures. In this paradigm, there is a sharp distinction between variables and data structures on the one hand and procedures on the other. Variables and data structures are data—they are the “stuff” that a program manipulates to produce other data, other “stuff.” Procedures do the manipulating, turning stuff into other stuff.

14.2 The Functional Programming Paradigm

Let us now turn to the second of the two major programming paradigms that we study in this Chapter: Functional Programming.

14.2.1 The Ubiquity of Functions in R

Let’s a bit more closely at our code snippet. Notice how prominently functions figure into it, on nearly every line. In fact, every line calls at least one function! This might seem unbelievable: after all, consider the line below:

num_summ_111 <- m111survey[, is_numerical]

There don’t appear to be any functions being called, here! But in fact two functions get called:

  1. The so-called assignment operator <- is actually a function in disguise: the more official—albeit less readable—form of variable <- value is:

    `<-`(variable, value)

    Thus, to assign the value 3 to that variable a one could write:

    `<-`(a, 3)
    a   # check that a is really 3
    ## [1] 3
  2. The sub-setting operator for vectors [, more formally known as extraction (see help(Extract)) is also a function. The expression m111survey[, isNumerical] is actually the following function-call in disguise:

    `[`(m111survey, isNumerical)

Indeed functions are ubiquitous in R. This is part of the significance of the following well-known remark by a developer of S, the precursor-language of R:

“To understand computations in R, two slogans are helpful:

  • Everything that exists is an object.
  • Everything that happens is a function call.”

—John Chambers

The second slogan indicates that functions are everywhere in R. It also corresponds to the first principle of the functional programming paradigm, namely:

Computation is regarded as the evaluation of functions.

14.2.2 Functions as First-Class Citizens

So functions are ubiquitous in R. Another interesting thing about them is that even though they seem to be associated with procedures—after all, they make things happen—they are, nevertheless, also objects. They are data, or “stuff” if you like.

This may not seem obvious at first. But look at the following code, where you can ask what type of thing a function is:

typeof(is.numeric)
## [1] "builtin"

The so-called “primitive” functions of R—the functions written not in R but in C-code—are “built in” objects. On the other hand, consider this user-defined function:

f <- function(x) x+3
typeof(f)
## [1] "closure"

Functions other than primitive functions are objects of type “closure.”34

If a function can be a certain type of thing, then it must be a “thing”—an object, something you can manipulate. For example, you can put functions in a list:

lst <- list(is.numeric, f)
lst
## [[1]]
## function (x)  .Primitive("is.numeric")
## 
## [[2]]
## function(x) x+3

Very importantly, you can make functions serve as argument for other functions, and functions can return other functions as their results. The following example demonstrates both of these possibilities.

cuber <- function(f) {
  g <- function(x) f(x)^3
  g
}
h <- cuber(abs)
h(-2)  # returns |-2|^3 = 2^3 = 8
## [1] 8

In fact, in R functions can be treated just like any variable. In computer programming, we say that such functions are first-class citizens.

Although it is not often stated as a separate principle of the functional programming paradigm it is true that in languages that provide support for functional programming, the following principle holds true:

Functions are first-class citizens.

14.2.3 Minimize Side Effects

In the code-snippet under consideration, we note that there are two types of functions:

  • functions that return a value;
  • functions that provide output to the console or make a change in the Global Environment.

Example of the first type of function included:

A function that produced output to the console was str().

The assignment function `<-`() added cols, isNumerical and numsm111 to the Global Environment, and also made changes to isNumerical in the course of the for-loop.

Of course we have seen examples of functions that do two of these things at once, for example:

my_fun <- function(x) {
  cat("my_fun is running!\n")  # output to console
  x + 3                       # return a value
}
my_fun(6)
## my_fun is running!
## [1] 9

In computer programming, output to the console, along with changes of state—changes to the Global Environment or to the file structure of your computer—are called side-effects. Functions that only return values and do not produce side-effects are called pure functions.

A third principle of the functional programming paradigm is:

Functions should be pure.

Now this principle is difficult to adhere to, and in fact if you were to adhere strictly to it in R then your programs would never “do” anything. There do exist quite practical programming languages in which all of the functions are pure—and this leads to some very interesting features such as that the order in which operations are evaluated doesn’t affect what the function returns—but these “purely functional” languages manage purity by having other objects besides functions produce the necessary side-effects. In R we happily let our functions have side-effects: we certainly want to do some assignment, and print things out to the console from time to time.

One way that R does support the third principle of functional programming is that it makes it easy to avoid having your functions modify the Global Environment. To see this consider the following example:

add_three <- function(x) {
  heavenly_hash <- 5
  x+3  # returns this value
}
result <- add_three(10)
result
heavenly_hash
## [1] 13
## Error: object 'heavenly_hash' not found

This is as we expect: the variable heavenly_ash exists only in the run-time environment that is created in the call to add_three(). As soon as the function finishes execution that environment dies, and heavenly_hash dies long with it. In particular, it never becomes part of the Global Environment.

If you really want your functions to modify the Global Environment—or any environment other than its run-time environment, for that matter—then you have to take special measures. You could, for example, use the super-assignment operator <<-:

add_three_side_effect <- function(x) {
  heavenly_hash <<- 5
  x+3  # returns this value
}
result <- add_three_side_effect(10)
result
## [1] 13
heavenly_hash
## [1] 5

The super-assignment operator looks for the name heavenly_hash in the parent environment of the run-time environment, If if finds heavenly_hash there then it changes its value to 5 and stops. Otherwise it looks in the next parent up, and so on until it reaches the Global Environment, at which point if it doesn’t find a heavenly_hash it creates one and gives it the value. In the example above, assuming you ran the function from the console, the parent environment is the Global Environment and the function has made a change to it: a side-effect.

Except in the case of explicit assignment functions like `<-`(), changes made by functions to the Global Environment can be quite problematic. After all, we are used to using functions without having to look inside them to see how they do their work. Even if we once wrote the function ourselves, we may not remember how it works, so if it creates side effects we may not remember that it does, and calling them could interfere with other important work that the program is doing. (If the program already has heavenly_hash in the Global Environment and the we call a function that changes it value, we could be in for big trouble.) Accordingly, R supports the third principle of functional programming to the extent of making it easy for you to avoid function calls that change your Global Environment.

14.2.4 Procedures as Higher-Order Function Calls

The last principle of the functional programming paradigms that we will state here isn’t really a formal principle: it is really more an indication of the programming style that prevails in languages where functions are first-class objects and that provide other support for functional programming. The final principle is:

As much as possible, procedures should be accomplished by function calls, In particular, loops should be replaced by calls to higher-order functions.

A higher-order function is simply a function that takes other functions as arguments. R provides a nice set of higher-order functions, many of which substitute for iterative procedures such as loops. In subsequent sections we will study the some of the most important higher-order functions, and see how they allow us to express some fairly complex procedures in a concise and readable way. You will also see how this style really blurs the distinction—so fundamental to procedural programming—between data and procedures. In functional programming, functions ARE data, and procedures are just function calls.

14.2.5 Functional Programming: A Summary

For our purposes, the principles of the functional programming paradigm are as follows:

  • Computation consists in the evaluation of functions.
  • Functions are first-class citizens in the language.
  • Functions should only return values; they should not produce side-effects. (At the very least they should not modify the Global Environment unless they are dedicated to assignment in the first place.)
  • As much as possible, procedures should be written in terms of function calls. In particular, loops should be replaced by calls to higher-order functions.

14.3 purrr Higher-Order Functions for Iteration

In the remainder of the Chapter we will study important higher-order functions: functions that take a function as an argument and apply that function to each element of another data structure. As we have said previously, such functions often serve as alternatives to loops.

The higher-order functions we study come from the package purrr, which is attached whenever we load the tidy-verse.

14.3.1 map() and Variations

Suppose that we want to generate five vectors, each of which consists of ten numbers randomly chosen between 0 and 1. We accomplish the task with a loop, as follows:

# set up a list of length 5:
lst <- vector(mode = "list", length = 5)
for (i in 1:5) {
  lst[[i]] <- runif(10)
}
str(lst)
## List of 5
##  $ : num [1:10] 0.271 0.189 0.267 0.956 0.473 ...
##  $ : num [1:10] 0.5677 0.9629 0.5131 0.0181 0.7333 ...
##  $ : num [1:10] 0.268 0.477 0.263 0.107 0.608 ...
##  $ : num [1:10] 0.2411 0.3267 0.0647 0.1426 0.5102 ...
##  $ : num [1:10] 0.364 0.524 0.604 0.119 0.835 ...

If we wanted the vectors to have length \(1, 4, 9, 16,\) and 25, then we could write:

lst <- vector(mode = "list", length = 5)
for (i in 1:5) {
  lst[[i]] <- runif(i^2)
}
str(lst)
## List of 5
##  $ : num 0.647
##  $ : num [1:4] 0.394 0.619 0.477 0.136
##  $ : num [1:9] 0.06738 0.12915 0.39312 0.00258 0.62021 ...
##  $ : num [1:16] 0.409 0.54 0.961 0.654 0.547 ...
##  $ : num [1:25] 0.96407 0.07147 0.95581 0.94798 0.00119 ...

In the first example, the elements in the vector 1:5 didn’t matter—we wanted a vector of length ten each time—and in the second case the elements in the 1:5 did matter, in that they determined the lengths of the five vectors produced. Of course in general we could apply runif() to each element of any vector at all, like this:

vec <- c(5, 7, 8, 2, 9)
lst <- vector(mode = "list", length = length(vec))
for (i in seq_along(vec)) {
  lst[[i]] <- runif(vec[i])
}
str(lst)
## List of 5
##  $ : num [1:5] 0.647 0.394 0.619 0.477 0.136
##  $ : num [1:7] 0.06738 0.12915 0.39312 0.00258 0.62021 ...
##  $ : num [1:8] 0.826 0.423 0.409 0.54 0.961 ...
##  $ : num [1:2] 0.1968 0.0779
##  $ : num [1:9] 0.818 0.942 0.884 0.166 0.355 ...

If we can apply runif() to each element of a vector, why not apply an arbitrary function to each element? That’s what the function map() will do for us. The general form of map() is:

map(.x, .f, ...)

In the template above:

  • .x can be a list or any atomic vector;
  • .f is a function that is to be applied to each element of .x. In the default operation of map(), each element of .x becomes in turn the first argument of .f.
  • ... consists of other arguments that are supplied as arguments for the .f function, in case you have to set other parameters of the function in order to get it to perform in the way you would like.

The result is always a list.

With map() we can get the list in our second example as follows:

how_many <- c(5, 7, 8, 2, 9)
lst <- 
  how_many %>%
  map(runif)
str(lst)
## List of 5
##  $ : num [1:5] 0.647 0.394 0.619 0.477 0.136
##  $ : num [1:7] 0.06738 0.12915 0.39312 0.00258 0.62021 ...
##  $ : num [1:8] 0.826 0.423 0.409 0.54 0.961 ...
##  $ : num [1:2] 0.1968 0.0779
##  $ : num [1:9] 0.818 0.942 0.884 0.166 0.355 ...

If we had wanted the random numbers to be between—say—4 and 8, then we would supply extra arguments to runif() as follows:

lst <- 
  how_many %>%
  map(runif, min = 4, max = 8)
str(lst)
## List of 5
##  $ : num [1:5] 6.59 5.58 6.47 5.91 4.54
##  $ : num [1:7] 4.27 4.52 5.57 4.01 6.48 ...
##  $ : num [1:8] 7.3 5.69 5.64 6.16 7.84 ...
##  $ : num [1:2] 4.79 4.31
##  $ : num [1:9] 7.27 7.77 7.54 4.66 5.42 ...

The default behavior of map() is that the .x vector supplies the first argument of .f. However, if some ... parameters are supplied then .x substitutes for the first parameter that is not mentioned in .... In the above example, the min and maxparameters are the second and third parameters for runif() so .x substitutes for the first parameter—the one that determines how many random numbers will be generated. In the example below, the vector lower_bounds substitutes for min, the second parameter of runif():

lower_bounds <- 1:3
lower_bounds %>%
  map(runif, n = 2, max = 8)
## [[1]]
## [1] 3.008575 5.964877
## 
## [[2]]
## [1] 6.514459 6.282105
## 
## [[3]]
## [1] 6.688118 7.431739

Sometimes we wish to vary two or more of the parameters of function. In that case we use pmap(). The first parameter of pmap() is named .l and takes a list of vectors (or lists). For example:

how_many <- c(3,1,4)
upper_bounds <- c(1, 5, 10)
list(how_many, upper_bounds) %>%
  pmap(runif, min = 0)
## [[1]]
## [1] 0.4142409 0.5140328 0.6190231
## 
## [[2]]
## [1] 0.193564
## 
## [[3]]
## [1] 3.0889976 7.4509632 3.6170952 0.2348365

Observe that pmap() knows to interpret the first element of the input-list—the vector how_many as giving the values of the first argument of runif(). The second parameter of runif() (min) is set at 0, so pmap() deduces that upper_bounds—the second element of the input-list—gives the values for the next next parameter in line, the parameter max.

One might just as well use pmap() to vary all three parameters:

how_many <- c(3,1,4)
lower_bounds <- c(-5, 0, 5)
upper_bounds <- c(0, 5, 10)
args <- list(how_many, lower_bounds, upper_bounds)
args %>%
  pmap(runif) %>% 
  str()
## List of 3
##  $ : num [1:3] -0.00743 -2.35882 -3.90104
##  $ : num 1.01
##  $ : num [1:4] 5.38 9.16 8.45 9.67

The .f parameter can be any function, including one that you define yourself. Here’s an example:

r_letters <- function(n, upper) {
  if (upper) {
    sample(LETTERS, size = n, replace = TRUE)
  } else {
    sample(letters, size = n, replace = TRUE)
  }
}
# vary number of letters to pick
sample_sizes <- c(3, 6, 9)   
# vary the case (upper, lower)
uppercase <- c(TRUE, FALSE, TRUE)  
list(sample_sizes, uppercase) %>% 
  pmap(r_letters)
## [[1]]
## [1] "O" "G" "A"
## 
## [[2]]
## [1] "x" "s" "p" "m" "n" "u"
## 
## [[3]]
## [1] "R" "I" "Q" "M" "D" "E" "Y" "M" "O"

You could also set f to be a function that you write on the spot, without even bothering to give it a name:

c(1, 3, 5) %>% 
  map(function(x) runif(3, min = 0, max = x))
## [[1]]
## [1] 0.2002036 0.2269999 0.5505148
## 
## [[2]]
## [1] 1.250093 0.430766 2.521417
## 
## [[3]]
## [1] 4.861459 2.029991 2.718963

In computer programming a function is called anonymous when it is not the value bound to some name. .

map() allows a shortcut for defining anonymous functions. The above call could have been written as:

c(1, 3, 5) %>% 
  map(~ runif(3, min = 0, max = .))
## [[1]]
## [1] 0.4939934 0.8951945 0.7801631
## 
## [[2]]
## [1] 2.5637547 0.6993537 2.7956469
## 
## [[3]]
## [1] 4.790131 4.959646 2.544537

The ~ indicates that the body of the function is about to be begin. The . stands for the parameter of the function.

When we introduced map() we said that .x was a vector or a list, In fact .x could be an object that can be coerced into a list. Hence it is quite common to use map() with the data frames: the frame is turned into a list, each element of which is a column of the frame. Here is an example:

data("m111survey", package = "bcscr")
number_na <-
  m111survey %>% 
  map(~ sum(is.na(.)))
str(number_na)
## List of 12
##  $ height         : int 0
##  $ ideal_ht       : int 2
##  $ sleep          : int 0
##  $ fastest        : int 0
##  $ weight_feel    : int 0
##  $ love_first     : int 0
##  $ extra_life     : int 0
##  $ seat           : int 0
##  $ GPA            : int 1
##  $ enough_Sleep   : int 0
##  $ sex            : int 0
##  $ diff.ideal.act.: int 2

Note that the elements of the returned list inherit the names of the input data frame. This holds for any named input:

numbers <- c(1, 3, 5)
names(numbers) <- c("one", "three", "five")
numbers %>% 
  map(~runif(3, min = 0, max = .))
## $one
## [1] 0.20446027 0.01439487 0.95556547
## 
## $three
## [1] 1.918565 1.987142 2.199431
## 
## $five
## [1] 4.372549 4.764371 2.637053

When the result can take on a form more simple than a list, it is possible to use variants of map() such as:

Thus we could obtain a named integer vector of the number of NA-values for each variable in m11survey as follows:

number_na <-
  m111survey %>% 
  map_int(~ sum(is.na(.)))
number_na
##          height        ideal_ht           sleep         fastest 
##               0               2               0               0 
##     weight_feel      love_first      extra_life            seat 
##               0               0               0               0 
##             GPA    enough_Sleep             sex diff.ideal.act. 
##               1               0               0               2

Here are the types of each variable:

m111survey %>% 
  map_chr(typeof)
##          height        ideal_ht           sleep         fastest 
##        "double"        "double"        "double"       "integer" 
##     weight_feel      love_first      extra_life            seat 
##       "integer"       "integer"       "integer"       "integer" 
##             GPA    enough_Sleep             sex diff.ideal.act. 
##        "double"       "integer"       "integer"        "double"

Here is a statement of whether or not each variable is a factor:

m111survey %>% 
  map_lgl(is.factor)
##          height        ideal_ht           sleep         fastest 
##           FALSE           FALSE           FALSE           FALSE 
##     weight_feel      love_first      extra_life            seat 
##            TRUE            TRUE            TRUE            TRUE 
##             GPA    enough_Sleep             sex diff.ideal.act. 
##           FALSE            TRUE            TRUE           FALSE

14.3.2 walk() and Variations

walk() is similar to map(), but is used when we are interested in producing side-effects. It applies its .f argument to each element of .x is was given, but also returns the .x in case we want to pipe it into some other function.

Here we use walk() only for its side-effect: we re-write a familiar function to print a pattern to the Console without using a loop.

pattern <- function(char = "*", n = 5) {
  line_length <- c(1:n, (n-1):1)
  the_line <- function(char, n) {
    cat(rep(char, times = n), "\n", sep = "")
  }
  line_length %>% walk(the_line, char = char)
}

pattern(char = "a", n = 7)
## a
## aa
## aaa
## aaaa
## aaaaa
## aaaaaa
## aaaaaaa
## aaaaaa
## aaaaa
## aaaa
## aaa
## aa
## a

The next example illustrates the use of the return-value of walk(). We would like to save plots of all numerical variables from the data frame m111survey, and also print summaries of them to the Console.

First we create a directory to hold the plots:

if ( !dir.exists("plots") ) dir.create("plots")

Next, we get the numerical variables in m111survey:

numericals <-
  m111survey %>% 
  keep(is.numeric)   # purrr::keep()

We used purrr::keep(), which retains only the elements of its input .x such that its second argument .p ( a function that returns a single TRUE or FALSE) returns TRUE.

We will also need the names of the numerical variables:

num_names <-
  numericals %>% 
  names()

We need a function to save the density plot of a single numerical variable:

save_graph <- function(var, varname) {
  p <-
    ggplot(data = NULL, aes(x = var)) +
    geom_density(fill = "burlywood") +
    labs(title = paste0(
      "Density plot for ",
      varname, ".")
    )
  ggsave(
    filename = paste0("plots/density_", varname, ".png"),
    plot = p, device = "png"
  )
}

We also need a function to produce a summary of a single numerical variable:

make_summary <- function(x, varname) {
  five <- fivenum(x, na.rm = TRUE)
  list(
    variable = varname,
    min = five[1],
    Q1 = five[2],
    median = five[3],
    Q3 = five[4],
    max = five[5]
  )
}

Now we walk through the process. We will actually use the functionpwalk(), which will take the following inputs:

  • .x (a list with two elements: the data frame of numerical variables and the vector of the names of these variables), and
  • .f (the function saveGraph, to make and save a density plot)

We also use pmap_dfr(), which takes a list consisting of the data frame and variable-names and constructs a data frame row-by-row, with each row summarizing one of the variables.

list(numericals, num_names) %>% 
  pwalk(save_graph) %>%  # returns the list
  pmap_dfr(make_summary)
## # A tibble: 6 x 6
##          variable   min    Q1  median     Q3   max
##             <chr> <dbl> <dbl>   <dbl>  <dbl> <dbl>
## 1          height  51.0  65.0  68.000  71.75    79
## 2        ideal_ht  54.0  67.0  68.000  75.00    90
## 3           sleep   2.0   5.0   7.000   7.00    10
## 4         fastest  60.0  90.5 102.000 119.50   190
## 5             GPA   1.9   2.9   3.225   3.56     4
## 6 diff.ideal.act.  -4.0   0.0   2.000   3.00    18

Check the plots directory; it should contain these files:

  • density_diff.ideal.act.png
  • density_fastest.png
  • density.GPA.png
  • density_height.png
  • density_ideal_ht.png
  • density_sleep.png

14.3.3 Example: Flowery Meadow Redux

In Section 4.3.3.1 we simulated people walking through a meadow, picking flowers until they had picked a desired number of flowers of a desired color. In Section 9.3.3 we used lists to store the results of such a simulation. Now we’ll see how to store the results as a data frame.

First, we modify the helper-function that simulates one person picking flowers so that, instead of returning a vector of colors, it returns a data frame:

## colors in the filed:
flower_colors <- c("blue", "red", "pink", "crimson", "orange")
## new helper-function:
walk_meadow_df <- function(person, color, wanted) {
  picking <- TRUE
  ## the following will be extended to hold the flowers picked:
  flowers_picked <- character()
  desired_count <- 0
  while (picking) {
    picked <- sample(flower_colors, size = 1)
    flowers_picked <- c(flowers_picked, picked)
    if (picked == color) desired_count <- desired_count + 1
    if (desired_count == wanted) picking <- FALSE
  }
  ## return a data frame:
  data.frame(
    person = rep(person, times = length(flowers_picked)),
    color = flowers_picked
  )
}

Note that the new function takes an extra parameter person, the name of the person picking the flowers.

Let’s try it out:

walk_meadow_df("Scarecrow", "red", 1)
##      person   color
## 1 Scarecrow    blue
## 2 Scarecrow    blue
## 3 Scarecrow crimson
## 4 Scarecrow crimson
## 5 Scarecrow crimson
## 6 Scarecrow crimson
## 7 Scarecrow     red

Now we write the function to make the data frame of results for a group of people. pmap() will come in handy.

all_walk_df <- function(people, favs, numbers) {
  ## initialize a list of the required length:
  list(people, favs, numbers) %>% 
  ## run it through pmap() to get a list of data frames:
  pmap(walk_meadow_df) %>% 
  ## the following purrr function converts the list of df's
  ## into one data frame, binding:
  list_rbind()
}

Let’s try it out:

results <-
  all_walk_df(
  people = c("Dorothy", "Toto"),
  favs = c("blue", "orange"),
  numbers = c(4, 2)
)

Here are the results:

14.3.4 Practice Exercises

  1. Use map() to produce a list of the squares of the whole numbers from 1 to 10.

  2. Use map_dbl() to produce a numerical vector of the squares of the whole numbers from 1 to 10.

  3. Use map_chr to state the type of each element of the following list:

    lst <- list(
      letters,
      seq(2, 20, by = 2),
      c(1L, 5L, 7L),
      1:10 > 5.5
    )
  4. Here are some people:

    people <- c("Bettina", "Raj", "Isabella", "Khalil")

    The following vector tells whether or not each person is a Grand Poo-Bah:

    status <- c("humble", "poobah", "poobah", "humble")

    Use pwalk() to properly greet each person. The result in the console should be as follows:

    ## Yo, dawg.
    ## Hail, O Grand Poo-Bah Raj!
    ## Hail, O Grand Poo-Bah Isabella!
    ## Yo, dawg.

14.3.5 Solutions to the Practice Exercises

  1. Try this:

    map(1:10, ~ .^2)

    This is more verbose, but works just as well:

    map(1:10, function(x) x^2)
  2. Try this:

    map_dbl(1:10, ~ .^2)
    ##  [1]   1   4   9  16  25  36  49  64  81 100

    Again the more verbose approach works just as well:

    map_dbl(1:10, function(x) x^2)
  3. Try this:

    map_chr(lst, typeof)
    ## [1] "character" "double"    "integer"   "logical"
  4. Try this:

    list(people, status) %>% 
      pwalk(function(person, type) {
        if (type == "poobah") {
          cat(
            "Hail, O Grand Poo-Bah ",
            person, "!\n", sep = ""
          )
        } else {
          cat("Yo, dawg.\n")
        }
      })

14.4 Other purrr Higher-Order Functions

14.4.1 keep() and discard()

keep() is similar to dplyr’s filter(), but whereas filter() chooses rows of a data frame based on a given condition, keep() chooses the elements of the input list or vector .x based on a condition named .p.

Examples:

# keep the numbers that are 1 more than a multiple of 3
1:20 %>% 
  purrr::keep(.p = ~ . %% 3 == 1)
## [1]  1  4  7 10 13 16 19
# keep the factors in m111survey
m111survey %>% 
  purrr::keep(is.factor) %>% 
  str()
## 'data.frame':    71 obs. of  6 variables:
##  $ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
##  $ love_first  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ extra_life  : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
##  $ seat        : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
##  $ enough_Sleep: Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
##  $ sex         : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...

discard(.x,, . p = condition) is equivalent to keep(.x, .p = !condition). Thus:

# discard numbers that are 1 more than a multiple of 3
1:20 %>% 
  purrr::discard(.p = ~ . %% 3 == 1)
##  [1]  2  3  5  6  8  9 11 12 14 15 17 18 20
# discard the factors in m111survey
m111survey %>% 
  purrr::discard(is.factor) %>% 
  str()
## 'data.frame':    71 obs. of  6 variables:
##  $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
##  $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
##  $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
##  $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
##  $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
##  $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

14.4.2 reduce()

Another important member of the purrr family is reduce() . Given a vector .x and a function .f that takes two inputs, reduce() does the following:

  • applies f to elements 1 and 2 of .x, getting a result;
  • applies f to the result and to element 3 of .x, getting another result;
  • applies f to this new result and to element 4 of .x, getting yet another result …
  • … and so on until all of the elements of .x have been exhausted.
  • then reduce() returns the final result in the above series of operations.

For example, suppose that you want to add up the elements of the vector:

vec <- c(3, 1, 4, 6)

Of course you could just use:

sum(vec)
## [1] 14

After all, sum() has been written to apply to many elements at once. But what if addition could only be done two numbers at a time? How might you proceed? You could:

  • add the 3 and 1 of (the first two elements of vec), getting 4;
  • then add 4 to 4, the third element of vec, getting 8;
  • then add 8 to 6, the final element of vec, getting 14;
  • then return 14.

reduce() operates in this way.

vec %>%
  reduce(.f = sum)
## [1] 14

Can you see how reduce() gets its name? Step by step, it “reduces” its .x argument, which may consist of many elements, to a single value.

A common application of reduce() is to take an operation that is defined on only two items and extend it to operate on any number of items. Consider, for example, the function intersect(), , which will find the intersection of any two vectors of the same type:

vec1 <- c(3, 4, 5, 6)
vec2 <- c(4, 6, 8, -4)
intersect(vec1, vec2)
## [1] 4 6

You cannot intersect three or more vectors at once:

intersect(vec1, vec2, c(4, 7, 9))
## Error in base::intersect(x, y, ...) : unused argument (c(4, 7, 9))

With reduce() you can intersect as many vectors as you like, provided that they are first stored in a list.

lst <- list(
  c("Akash", "Bipan", "Chandra", "Devadatta", "Raj"),
  c("Raj", "Vikram", "Sita", "Akash", "Chandra"),
  c("Akash", "Raj", "Chandra", "Bipan", "Lila"),
  c("Akash", "Vikram", "Devadatta", "Raj", "Lila")
)
lst %>% 
  reduce(intersect)
## [1] "Akash" "Raj"

You can write your own function to supply as the argument for .f, but it has to be able to operate on two arguments. reduce() will take the first argument of the .f function to be what has been “accumulated” so far, and the second argument of the .f function—the value to be combined with what has been accumulated—will be provided by the current element of .x.

As a simple example, let’s write our own reduce-summer in a way that shows the user the reduction process at work:

## the .f function:
my_summer <- function(acc, curr) {
  cat("So far I have ", acc, ",\n")
  cat(
    "But just now I was given " , curr, 
    " to add in.\n\n", sep = ""
  )
  sum(acc, curr)
}

## .x will be the whole numbers from 1 to 4:
1:4 %>% 
  reduce(.f = my_summer)
## So far I have  1 ,
## But just now I was given 2 to add in.
## 
## So far I have  3 ,
## But just now I was given 3 to add in.
## 
## So far I have  6 ,
## But just now I was given 4 to add in.
## [1] 10

When you write your own .f function, it’s a good idea to use names for the parameters that remind you of their role in the reduction process. acc (for “accumulated”) and curr (for “current”) are used above.

reduce() can take an argument called .init. When this argument is given a value, operation begins by applying to .f to .init and the first element of .x. For example:

1:4 %>% 
  reduce(.f = my_summer, .init = 100)
## So far I have  100 ,
## But just now I was given 1 to add in.
## 
## So far I have  101 ,
## But just now I was given 2 to add in.
## 
## So far I have  103 ,
## But just now I was given 3 to add in.
## 
## So far I have  106 ,
## But just now I was given 4 to add in.
## [1] 110

14.4.2.1 An Extended Example of Reduction

Let’s apply reduce() with .init to the task of making a truth table: the set of all \(2^n\) logical vectors of a given length \(n\).

The set \(S_1\) of vectors of length \(n = 1\) consists of only two vectors:

##           
## vec1  TRUE
## vec2 FALSE

Now consider a systematic way to construct the set \(S_2\) of all the vectors of length two. We know that there are four such vectors:

##                 
## vec1  TRUE  TRUE
## vec2  TRUE FALSE
## vec3 FALSE  TRUE
## vec4 FALSE FALSE

Observe that the first two of them begin with TRUE and end with the set \(S_1\) of vectors of length one:

##                
## vec1 TRUE  TRUE
## vec2 TRUE FALSE

The last two of them begin with FALSE and also end with \(S_1\):

##                 
## vec3 FALSE  TRUE
## vec4 FALSE FALSE

Now consider \(S_3\), the set of all eight vectors of length three:

##                       
## vec1  TRUE  TRUE  TRUE
## vec2  TRUE  TRUE FALSE
## vec3  TRUE FALSE  TRUE
## vec4  TRUE FALSE FALSE
## vec5 FALSE  TRUE  TRUE
## vec6 FALSE  TRUE FALSE
## vec7 FALSE FALSE  TRUE
## vec8 FALSE FALSE FALSE

Observe that the first four of them end begin with TRUE and and with the vectors of \(S_2\):

##                      
## vec1 TRUE  TRUE  TRUE
## vec2 TRUE  TRUE FALSE
## vec3 TRUE FALSE  TRUE
## vec4 TRUE FALSE FALSE

The last four of them begin with FALSE and also end with the vectors of \(S_2\):

##                       
## vec5 FALSE  TRUE  TRUE
## vec6 FALSE  TRUE FALSE
## vec7 FALSE FALSE  TRUE
## vec8 FALSE FALSE FALSE

The pattern is now clear. If for any \(m \ge 1\) you are in possession of the \(2^m \times m\) matrix \(S_m\) of all possible vectors of length \(m\), then to obtain the \(2^{m+1} \times (m+1)\) matrix \(S_{m+1}\) of all possible vectors of length \(m+1\) you should:

  • stack \(2^m\) TRUEs on top of \(2^m\) FALSEs, creating a \(2^{m+1} \times 1\) matrix \(U\);
  • stack the \(S_m\) underneath itself, creating a \(2^{m+1} \times m\) matrix \(V\);
  • place \(U\) next to \(V\).

reduce() with .init set to \(S_1\) is appropriate for this iterative building process. Here is an implementation:

make_table <- function(n, verbose = FALSE) {
  # make .init (S_1)
  s1 <- matrix(c(TRUE, FALSE), nrow = 2)
  rownames(s1) <- c("vec1", "vec2")
  colnames(s1) <- c("")
  
  # make .f
  build_next <- function(accum, value) {
    if (verbose) {
      cat(
        "On value ", value, 
        " with accumulated material:",
        sep = ""
      )
      print(accum)
    }
    if (value == 1) return(accum)
    r <- nrow(accum)
    u <- c(
      rep(TRUE, times = r),
      rep(FALSE, times = r)
    )
    v <- rbind(accum, accum)
    next_matrix <- cbind(u, v)
    colnames(next_matrix) <- rep("", times = value)
    rownames(next_matrix) <- paste(
      "vec", 1:(2^value), sep = ""
    )
    if (verbose) {
      cat(
        "Finishing value", value, 
        ", and I've built:",
        sep = ""
      )
      print(next_matrix)
      cat("\n\n")
    }
    next_matrix
  }
  
  # build from .init to the final product S_n
  reduce(.x = 1:n, .f = build_next, .init = s1)
}

We have included a verbose option so we can watch the process as it unfolds.

Note also that the parameters for the .f function are named:

  • acc (what has been “accumulated” up to the current step), and
  • value (the value of .x at the current step).

It’s conventional to give these or similar names to the parameters of the building-function.

Let’s try it out:

make_table(3, verbose = TRUE)
## On value 1 with accumulated material:          
## vec1  TRUE
## vec2 FALSE
## On value 2 with accumulated material:          
## vec1  TRUE
## vec2 FALSE
## Finishing value2, and I've built:                
## vec1  TRUE  TRUE
## vec2  TRUE FALSE
## vec3 FALSE  TRUE
## vec4 FALSE FALSE
## 
## 
## On value 3 with accumulated material:                
## vec1  TRUE  TRUE
## vec2  TRUE FALSE
## vec3 FALSE  TRUE
## vec4 FALSE FALSE
## Finishing value3, and I've built:                      
## vec1  TRUE  TRUE  TRUE
## vec2  TRUE  TRUE FALSE
## vec3  TRUE FALSE  TRUE
## vec4  TRUE FALSE FALSE
## vec5 FALSE  TRUE  TRUE
## vec6 FALSE  TRUE FALSE
## vec7 FALSE FALSE  TRUE
## vec8 FALSE FALSE FALSE
##                       
## vec1  TRUE  TRUE  TRUE
## vec2  TRUE  TRUE FALSE
## vec3  TRUE FALSE  TRUE
## vec4  TRUE FALSE FALSE
## vec5 FALSE  TRUE  TRUE
## vec6 FALSE  TRUE FALSE
## vec7 FALSE FALSE  TRUE
## vec8 FALSE FALSE FALSE

Of course in practice we would not turn on the verbose option:

make_table(4)
##                              
## vec1   TRUE  TRUE  TRUE  TRUE
## vec2   TRUE  TRUE  TRUE FALSE
## vec3   TRUE  TRUE FALSE  TRUE
## vec4   TRUE  TRUE FALSE FALSE
## vec5   TRUE FALSE  TRUE  TRUE
## vec6   TRUE FALSE  TRUE FALSE
## vec7   TRUE FALSE FALSE  TRUE
## vec8   TRUE FALSE FALSE FALSE
## vec9  FALSE  TRUE  TRUE  TRUE
## vec10 FALSE  TRUE  TRUE FALSE
## vec11 FALSE  TRUE FALSE  TRUE
## vec12 FALSE  TRUE FALSE FALSE
## vec13 FALSE FALSE  TRUE  TRUE
## vec14 FALSE FALSE  TRUE FALSE
## vec15 FALSE FALSE FALSE  TRUE
## vec16 FALSE FALSE FALSE FALSE

14.4.3 Practice Exercises

  1. The operator * (multiplication) is really a function:

    `*`(3,5)
    ## [1] 15

    But it can only multiply two numbers at once. The R-function prod() cna handle as many numbers as you like:

    prod(3,5,2,7)
    ## [1] 210

    Use reduce() and * to write your own function product() that takes a numerical vector vec and returns the product of the elements of the vector. It should work liek this:

    product(vec = c(3,4,5))
    ## [1] 60

    (Hint: in the call to reduce() you will have to the refer to the *-function as `*`.)

  2. Modify the function product() so that it in a single call to reduce() it multiplies the number 2 by the product of the elements of vec. (Hint: set .init to an appropriate value.)

  3. The data frame iris gives information on 150 irises. Use keep() to create a new data frame that includes only the numerical variables having a mean greater than 3.5.

14.4.4 Solutions to the Practice Exercises

  1. Try this:

    product <- function(vec) {
      reduce(vec, .f = `*`)
    }
  2. Try this:

    product <- function(vec) {
      reduce(vec, .f = `*`, .init = 2)
    }
  3. Try this:

    big_iris <-
      iris %>%
      keep(is.numeric) %>% 
      keep(~ mean(.) > 3.5)
    str(big_iris)
    ## 'data.frame':    150 obs. of  2 variables:
    ##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    ##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

    The following does not work. Why?

    big_iris <-
      iris %>%
        keep(function(x) {
          is.numeric(x) & mean(x) > 3.5
        }
      )
    }

14.5 Functionals vs. Loops

The higher-order functions we have studied in this chapter are often called functionals. As we pointed out earlier, they deliver results that could have been produced by a writing a loop of some sort.

Once you get used to functionals, you will find that they are often more “expressive” than loops—easier for others to read and to understand, and less prone to bugs. Also, many of them are optimized by the developers of R to run a bit faster than an ordinary loop written in R.

For example, consider the following list. It consists of ten thousand vectors, each of which contains 100 randomly-generated numbers.

lst <- map(rep(100, 10000), runif)

If we want the mean of each vector, we could write a loop:

means <- numeric(10000)
for (i in 1:10000) {
  means[i] <- mean(lst[[i]])
}

Or we could use map_dbl():

means <- map_dbl(lst, mean)

Comparing the two using system.time() , on my machine I got:

system.time(means <- map_dbl(lst, mean))
##   user  system elapsed 
##  1.557   0.073   1.630 

For the loop, I get:

system.time({
  means <- numeric(10000)
  for (i in 1:10000) {
    means[i] <- mean(lst[[i]])
  }
})
##   user  system elapsed 
##  1.653   0.075   1.730 

The map-function is a bit faster, but the difference is small.

Remember also that vectorization is much faster than looping, and is also usually quite expressive, so don’t struggle to take a functional approach when vectorization is possible. (This advice applies to a several examples from this Chapter, in which the desired computations had already been accomplished in earlier chapters by some form of vectorization.)

14.6 Conclusion

In this Chapter we have concentrated on only a single aspect of the Functional Programming paradigm: exploiting the fact that functions are first-class citizens in R, we studied a number of higher-order functions that can substitute for loops. There is certainly a great deal more to Functional Programming than the mere avoidance of loops, but we’ll end our study at this point. Familiarity with higher-order functions will stand you in good stead when you begin, in subsequent courses on web programming, to learn the JavaScript language. JavaScript makes constant use of higher-order functions!

Glossary

Programming Paradigm

A programming paradigm is a way to describe some of the features of programming languages. Often a paradigm includes principles concerning the use of these features, or embodies a view that these features have special importance and utility in good programming practice.

Procedural Programming

A programming paradigm that solves problems with programs that can be broken up into collections of variables, data structures and procedures. This paradigm tends to draw a sharp distinction between variables and data structures on the one hand and procedures on the other.

Functional Programming

A programming paradigm that stresses the central role of functions. Some of its basic principles are:

  • Computation consists in the evaluation of functions.
  • Functions are first-class citizens in the language.
  • Functions should only return values; they should not produce side-effects.
  • As much as possible, procedures should be written in terms of function calls.
Pure Function

A function that does not produce side-effects.

Side Effect

A change in the state of the program (i.e., a change in the Global Environment) or any interaction external to the program (i.e., printing to the console).

Higher-Order Function

A function that takes another function as an argument.

Anonymous Function

A function that does not have a name.

Refactoring

The act of rewriting computer code so that it performs the same task as before, but in a different way. (This is usually done to make the code more human-readable or to make it perform the task more quickly.)

Exercises

  1. Explain in words what the following line of code produces when given a numerical vector y:

    map(y, function(x) x^3 + 1)

    In the course of your explanation, say whether the result is a vector or a list.

  2. Which do you think works faster for a given numerical vector y? This code:

    map(y, function(x) sqrt(x))

    Or this code?

      sqrt(y)

    Justify your answer with a convincing example, using system.time(). What moral do you draw from this?

  3. To refactor computer code is to rewrite the code so that it does the same thing, but in a different way. We might refactor code in order to make it more readable by humans, or to make it perform its task more quickly.

    Refactor the following code so that it uses keep() instead of a loop:

    df <- bcscr::m111survey
    keep_variable <- logical(length(names(df)))
    for (col in seq_along(keep_variable)) {
      var <- df[, col]
      is_numeric <- is.numeric(var)
      all_there <- !any(is.na(var))
      keep_variable[col] <- is_numeric && all_there
    }
    new_frame <- df[, keep_variable]
    head(new_frame)
  4. The following function produces a list of vectors of uniform random numbers, where the lower and upper bounds of the numbers are given by the arguments to the parameters lower and upper respectively, and the number of vectors in the list and the number of random numbers in each vector are given by a vector supplied to the parameter vecs.

    random_sims <- function(vecs, lower = 0, upper= 1, seed = NULL) {
      # set seed if none is provided by the user
      if (!is.null(seed)) {
        set.seed(seed)
      }
    
      lst <- vector(mode = "list", length = length(vecs))
      for (i in seq_along(vecs)) {
        lst[[i]] <- runif(vecs[i], min = lower, max = upper)
      }
      lst
    }

    Refactor the code for random_sims() so that it uses map() instead of a loop.

  5. The following enhanced version of randomSims() is even more flexible, as it allows both the upper and lower limits for the randomly-generated numbers to vary with each vector of numbers that is produced.

    random_sims2 <- function(vecs, lower, upper, seed = NULL) {
      # validate input
      if (!(length(vecs) == length(upper) && length(upper) == length(lower)) ) {
        return(
          cat("All vectors entered must have the same length.")
        )
      }
      if (any(upper < lower)) {
        return(
          cat(paste0(
            "Every upper bound must be at least as ",
            "big as the corresponding lower bound."
            )
          )
        )
      }
      # set seed if none is provided by the user
      if (!is.null(seed)) {
        set.seed(seed)
      }
    
      lst <- vector(mode = "list", length = length(vecs))
      for (i in seq_along(vecs)) {
        lst[[i]] <- runif(
          vecs[i], min = lower[i], max = upper[i]
        )
      }
      lst
    }

    Use pmap() to refactor the code for random_sims2() so as to avoid using the loop.

  6. Supposing that y is a numerical vector, explain in words what the following code produces:

    y %>% keep(function(x) x >= 4)
  7. Write a line of code using the sub-setting operator [ that produces the same result as the code in the previous problem.

  8. Use keep() to write a function called odd_members() that, given any numerical vector, returns a vector containing the odd numbers of the given vector. Your function should take a single argument called vec, the given vector. A typical example of use would be as follows:

    odd_members(vec = 1:10)
    ## [1] 1 3 5 7 9
  9. You are given the following list of character vectors:

       lst <- list(
     c("Akash", "Bipan", "Chandra", "Devadatta", "Raj"),
     c("Raj", "Vikram", "Sita", "Akash", "Chandra"),
     c("Akash", "Raj", "Chandra", "Bipan", "Lila"),
     c("Akash", "Vikram", "Devadatta", "Raj", "Lila")
    )

    Use reduce() and the union() function to obtain a character vector that is the union of all the vectors in lst.

  10. Remember the function subStrings() from the exercises of the Chapter on Strings? Refactor it so that it does EXACTLY the same thing but makes no use of loops.

  11. Solve Part One of Advent of Code 2022 Day 3. Save your input file in your submit folder with the filename input_aoc_2022-03.txt, and read in the input data, naming it input, using the following code:

    input <- readLines("input_aoc_2022-03.txt")
  12. Solve Part One of Advent of Code 2022 Day 25. Save your input file in your submit folder with the filename input_aoc_2022-25.txt, and read in the input data, naming it input, using the following code:

    input <- readLines("input_aoc_2022-25.txt")

    (Hint: The snafu numbering-system bears some relationship to base-5 numbering. After reviewing Section 11.4, write two helper-functions: one to convert numbers to “base-snafu” and another to convert base-snafu representations to numbers.)