0.1 The Why of These Notes: Remarks for Colleagues
There is a plethora of books on R, covering pretty much every domain of application of the language, from ecology to spatial statistics to machine learning and data science. There are even some books—among the very finest of R-books, in my view—on R as a programming language.
These Notes are not intended to supersede or to compete with any of the existing literature on R. Instead they are intended to serve the curricular needs of the Computer Science minor at the College where I teach—a minor that emphasizes data analysis primarily, but with a strong focus also on web design and on the increasingly important interfaces between these two areas (e.g., interactive graphics and web apps for working with data and/or reporting the results of data analysis). Students will undertake a serious study of two major scripting languages: R for data analysis and JavaScript for web programming, both from a fairly systematic programming point of view, with due attention to procedural, functional and (to a lesser extent) object-oriented programming paradigms.
The question is: which language to use in the freshman year? Some institutions are moving toward JavaScript: in fact Stanford University piloted JavaScript in several sections of its introductory CS course in the Fall of 2017. There are certainly considerations in favor of a JavaScript-first approach: it’s a popular language, with Node available as an interactive run-time environment and the browser as a environment in which exciting applications can be built quickly. And whereas R is less widely-used and is still considered a domain-specific language, JavaScript can rightly be said to have made the leap into the ranks of general-purpose programming languages. R also has the reputation of being a prickly language with a somewhat inconsistent syntax and with documentation that is “expert-friendly” at best.
On the other hand R is designed for one-line interactivity at the console, so it’s possible for a beginner to get simple programs working quickly. The R-ecosystem has also become a lot more user-friendly in recent years. The RStudio IDE is comparable to top-flight integrated development environments for many other major languages and yet is still relatively lightweight and accessible to beginners. The Server version of R Studio is especially useful for new programmers, as it saves them from having to deal with installation and other IT issues on their own machines, permitting them to focus on coding. It’s also quite convenient, in a server setting, to make class materials available and to collect and return assignments. R Markdown is fine platform for producing course notes (this book is written in R Markdown with the excellent bookdown package (Xie 2020)) and slides as well. Students, too, can use R Markdown to both write and discuss their programs in a single document. The blogdown package (Xie, Dervieux, and Presmanes Hill 2021) permits students to begin writing for the public about technical programming issues—or about anything at all, really, as more than a few of them are taking majors in the Humanities—thus building up a professional resume of online work. When it’s time to learn about databases, students can leverage a body of recent work (see Databases Using R) that renders the R Studio environment nearly as friendly for interaction with databases as dedicated tools such as MySQL Workbench. Finally, the shiny package (Chang et al. 2020) permits students to build simple interactive web apps for data analysis that can be used by non-coders. Both blogdown and shiny prompt students to consider early on—even in the first year, if the pacing is right—concepts of web design, the other focus of the minor.
Hence the choice was made to teach a first-year computer science course, to beginning programmers, with R. As I pointed out earlier, there do exist some excellent books on R as a programming language that do not presume previous experience with R. One example is Norman Matloff’s The Art of R Programming (Matloff 2011). Matloff, however, presumes that the reader either has prior programming experience in some other language or else possesses sufficient computational maturity, acquired perhaps through extensive prior training in the mathematical sciences. Another great text is Garrett Grolemund’s Hands-on Programming with R (Grolemund 2014). Grolemund’s book is lively and to-the-point, and starts off with excellent motivating examples. Grolemund is also a master explainer, and he has put considerable effort into visual representation of programming concepts such as element-wise operations on vectors and the enclosure-relationships between environments. On the other hand, even though he doesn’t assume that the reader has prior coding experience, Grolemund does assume some prior background in data analysis and a strong motivation, on the reader’s part, to persevere with nontrivial R-programming issues such as lexical scoping in the hopes of eventual payoffs in programming for data science. In short, Grolemund also assumes more computational maturity than will be usually be found among beginning programmers at many small liberal arts colleges.
Hence the niche for the Notes offered here. I aim to be more copious and slower-paced than Grolemund and less sophisticated than Matloff. These notes will also contain a more extensive set of problems, ranging in difficulty from practice exercises to fairly extended projects that students might write up in R Markdown documents.
Experienced programmers and R enthusiasts will be struck by the absence of certain topics. Programmers will observe that there is no real attention to algorithms (sorting is just sort()
or order()
), and although functions receive lots of attention there is no mention of recursion. In future editions I might cover recursion, as I believe that it is wonderful for the development of thinking skills, but it’s not likely that a web developer or data analyst would have the need to write a recursive function. Time spent on recursion and on various efficient algorithms for sorting and searching may be better spent, in my view, on extended programming projects, Shiny apps and blogging, and the introduction of programmer’s trade-tools such as version control and GitHub. I hope by the end of the first year to have made time for all of these out-of-book topics.
The Notes give more attention to base R functions than other introductory texts directed to data analysts, but we do introduce elements of the tidyverse as appropriate. The pipe operator is introduced in connection with data frames, ggplot2 and graphing are treated in some detail, string operations and regular expressions are managed primarily with stringr, and the approach to higher-order functions is through purrr. Full treatment of the data wrangling is deferred, however, to later courses.
The first-semester course is required for mathematics and physics majors and for students in our pre-engineering program, so a central application of the early material is simulation of random processes. I believe that this makes the Notes relevant for students in other disciplines—e.g., biology and finance—in a way that complements their use of R for data analysis.
Two of the most fundamental topics in any comprehensive discussion of the R language—lexical scoping and computing on the language—are absent from this book. Lexical scoping and its implications are mentioned only in a brief footnote. Partly this is due the fact that most of the elementary applications of lexical scoping mentioned in the literature are related to scientific computing, which won’t be a concern for most of my students. Certainly lexical scoping is important for understanding how R-packages work, but elementary students don’t author packages. As for computing on the language it is true that users are affected by it all the time (e.g., whenever they use functions with a formula interface), but generally one need not perform any computation on the language until one begins writing formula-interface functions for the benefit of casual R-users.
On the other hand I have made some effort to explore programming paradigms other than procedural programming, perhaps in a bit more depth than in other elementary texts that teach with a scripting language. There is a chapter on functional programming that, although it admittedly does not get far into the functional paradigm, at least does treat extensively of R’s support for higher-order functions. A chapter on object-oriented programming covers not only the generic-function OO that has been with R from the start but also an implementation of message-passing OO (Winston Chang’s R6 package (Chang 2020)). My hope is that these topics will not only sharpen my students’ R-programming skills but also prepare them for encounters with the OO-methods and higher-order functions that are ubiquitous in JavaScript. Finally, there is a pretty serious chapter on regular expressions, because:
- they are useful in data analysis;
- I have not found a treatment of regular expressions in R that a person without significant prior exposure to them in other languages has a prayer of following;
- and because if you master regular expressions then you feel like a wizard.
As for the numerous Wizard of Oz-themed examples, I can offer no defense other than haste in composition and the fact that the Wizard of Oz is now in the public domain.