Preface

Remarks for Students

These notes will be your primary source for CSC 115: Computer Science I, offered at Georgetown College. They will also carry you most of the way through CSC 215.

For the first semester you won’t need a computer of your own: you can do all of your work on the R Studio Server, which you will access with your College network username and password. Eventually, though, you will need to learn how to install and maintain professional software development tools on your own machine, so at some point in CSC 215 you will install R, R Studio and various other tools on a laptop of your own.

Instructional videos on selected topics will be published from time to time on this YouTube Channel: https://www.youtube.com/user/GCstats, in the CSC playlist.

These notes are about the R programming language as such, so although in the class we will work within the R Studio Integrated Development Environment right from the start, R Studio is not directly covered here. Eventually we will begin to write documents in R Markdown, which is also not treated in these Notes. For a resource on these topics in written form that will supplement class instruction and the videos on our YouTube Channel, you might want to consult the excellent little book Getting used to R, RStudio, and R Markdown (Ismay 2016).

Remarks for Colleagues

There is a plethora of books on R, covering pretty much every domain of application of the language, from ecology to spatial statistics to machine learning and data science. There are even some books—among the very finest of R-books, in my view—on R as a programming language.

These notes are not intended to supersede or to compete with any of the existing literature on R. Instead they are intended to serve the curricular needs of the Computer Science minor at the College where I teach—a minor that emphasizes data analysis primarily, but with a strong focus also on web design and on the increasingly important interfaces between these two areas (e.g., interactive graphics and web apps for working with data and/or reporting the results of data analysis). Students will undertake a serious study of two major scripting languages: R for data analysis and JavaScript for web programming, both from a fairly systematic programming point of view, with due attention to procedural, functional and (to a lesser extent) object-oriented programming paradigms.

The question is: which language to use in the freshman year? Some institutions are moving toward JavaScript: in fact Stanford University piloted JavaScript in several sections of its introductory CS course in the Fall of 2017. There are certainly considerations in favor of a JavaScript-first approach: it’s a popular language, with Node available as an interactive run-time environment and the browser as a environment in which exciting applications can be built quickly. And whereas R is less widely-used and is still considered a domain-specific language, JavaScript can rightly be said to have made the leap into the ranks of general-purpose programming languages. R also has the reputation of being a prickly language with a somewhat inconsistent syntax and with documentation that is “expert-friendly” at best.

On the other hand R is designed for one-line interactivity at the console, so it’s possible for a beginner to get simple programs working quickly. The R-ecosystem has also become a lot more user-friendly in recent years. The RStudio IDE is comparable to top-flight integrated development environments for many other major languages and yet is still relatively lightweight and accessible to beginners. The Server version of R Studio is especially useful for new programmers, as it saves them from having to deal with installation and other IT issues on their own machines, permitting them to focus on coding. It’s also quite convenient, in a server setting, to make class materials available and to collect and return assignments. R Markdown is fine platform for producing course notes (this book is written in R Markdown with the excellent bookdown package (Xie 2024)) and slides as well. Students, too, can use R Markdown to both write and discuss their programs in a single document. The blogdown package (Xie, Dervieux, and Presmanes Hill 2024) permits students to begin writing for the public about technical programming issues—or about anything at all, really, as more than a few of them are taking majors in the Humanities—thus building up a professional resume of online work. When it’s time to learn about databases, students can leverage a body of recent work (see Databases Using R) that renders the R Studio environment nearly as friendly for interaction with databases as dedicated tools such as MySQL Workbench. Finally, the shiny package (Chang et al. 2024) permits students to build simple interactive web apps for data analysis that can be used by non-coders. Both blogdown and shiny prompt students to consider early on—even in the first year, if the pacing is right—concepts of web design, the other focus of the minor.

Hence in 2017 the choice was made to teach a first-year computer science course, to beginning programmers, with R. As I pointed out earlier, there do exist some excellent books on R as a programming language that do not presume previous experience with R. One example is Norman Matloff’s The Art of R Programming (Matloff 2011). Matloff, however, presumes that the reader either has prior programming experience in some other language or else possesses sufficient computational maturity, acquired perhaps through extensive prior training in the mathematical sciences. Another great text is Garrett Grolemund’s Hands-on Programming with R (Grolemund 2014). Grolemund’s book is lively and to-the-point, and starts off with excellent motivating examples. Grolemund is also a master explainer, and he has put considerable effort into visual representation of programming concepts such as element-wise operations on vectors and the enclosure-relationships between environments. On the other hand, even though he doesn’t assume that the reader has prior coding experience, Grolemund does assume some prior background in data analysis and a strong motivation, on the reader’s part, to persevere with nontrivial R-programming issues such as lexical scoping in the hopes of eventual payoffs in programming for data science. In short, Grolemund also assumes more computational maturity than will be usually be found among beginning programmers at many small liberal arts colleges.

Hence the niche for the Notes offered here. I aim to be more copious and slower-paced than Grolemund and less sophisticated than Matloff. These notes will also contain a more extensive set of problems, ranging in difficulty from practice exercises to fairly extended projects that students might write up in R Markdown documents.

Experienced programmers and R enthusiasts will be struck by the absence of certain topics. Programmers will observe that there is no real attention to algorithms (sorting is just sort() or order()), and although functions receive lots of attention there is no mention of recursion. I hope soon to add a chapter on recursion, as I believe that it is wonderful for the development of thinking skills, but it’s not likely that a web developer or data analyst would have the need to write a recursive function. Time spent on recursion and on various efficient algorithms for sorting and searching may be better spent, in my view, on extended programming projects, Shiny apps and blogging, and the introduction of programmer’s trade-tools such as version control and GitHub.

The Notes give more attention to base R functions than other introductory texts directed to data analysts, but we do introduce elements of the tidyverse as appropriate. The pipe operator is introduced in connection with data frames, ggplot2 and graphing are treated in some detail, string operations and regular expressions are managed primarily with stringr, and the approach to higher-order functions is through purrr. Full treatment of the data wrangling is deferred, however, to later courses.

The first-semester course is required for mathematics and physics majors and for students in our pre-engineering program, so a central application of the early material is simulation of random processes. I believe that this makes the Notes relevant for students in other disciplines—e.g., biology and finance—in a way that complements their use of R for data analysis.

Two of the most fundamental topics in any comprehensive discussion of the R language—lexical scoping and computing on the language—are absent from this book. Lexical scoping and its implications are mentioned only in a brief footnote (though it will get a fuller treatment when recursion is added to the book). Partly this is due the fact that most of the elementary applications of lexical scoping mentioned in the literature are related to scientific computing, which won’t be a concern for most of my students. Certainly lexical scoping is important for understanding how R-packages work, but elementary students don’t author packages. As for computing on the language it is true that users are affected by it all the time (e.g., whenever they use functions with a formula interface), but generally one need not perform any computation on the language until one begins writing formula-interface functions for the benefit of casual R-users.

On the other hand I have made some effort to explore programming paradigms other than procedural programming, perhaps in a bit more depth than in other elementary texts that teach with a scripting language. There is a chapter on functional programming that, although it admittedly does not get far into the functional paradigm, at least does treat extensively of R’s support for higher-order functions. A chapter on object-oriented programming covers not only the generic-function OO that has been with R from the start but also an implementation of message-passing OO (Winston Chang’s R6 package (Chang 2021)). My hope is that these topics will not only sharpen my students’ R-programming skills but also prepare them for encounters with the OO-methods and higher-order functions that are ubiquitous in JavaScript. Finally, there is a pretty serious chapter on regular expressions, because:

  • they are useful in data analysis;
  • I have not found a treatment of regular expressions in R that a person without significant prior exposure to them in other languages has a prayer of following;
  • and because if you master regular expressions then you feel like a wizard.

As for the numerous Wizard of Oz-themed examples, I can offer no defense other than haste in composition—due to various curricular exigencies, the first version of the notes had to be written from scratch over a period of about two months in the spring of 2017—and the fact that the Wizard of Oz being now in the public domain, was a convenient source.

History of R

The story of R begins at Bell Labs in 1975, with the development, by John Chambers and several other colleagues, of the S language for statistical computing. The language became well-known among statisticians and data analysts, especially in the academic community.

In the early 1990’s Ross Ihaka of the University of Auckland in New Zealand was making a study of the Scheme language as described in the classic MIT textbook Structure and Interpretation of Computer Programs (Harold Abelson and Sussman 1996), and was impressed with the possibilities of the language for data analysis applications. Desiring to build a free analysis tool for his graduate students, Ihaka recruited his Auckland colleague Robert Gentleman in the project of developing a language with an external syntax similar to S but with an underlying engine based heavily upon Scheme. Because of the similarity with the better-known S—or, by some accounts, because of the initial letter in the first names of both men—they named their new language “R”.

Initially the Ihaka and Gentleman assumed that their work on the new language was scarcely more than “playing games” and that it would not be used outside of the University of Auckland. Eventually, though, the two placed a small announcement of their project on the s-news email list and began to draw the interest of other statisticians, including Martin Machler of the Swiss Federal Institute of Technology in Zurich, Switzerland. Machler saw great potential for R, and in 1995 persuaded Ihaka and Gentleman to release it as “free software” under a GNU Public License. The decision to make R free stimulated further interest in the language and encouraged many experts in statistical computation to become involved in its further development.

The first official public release of R (version 1.0.0) occurred on February 29, 2000. Since that time R has grown in popularity at an increasing rate, to the point where it is by now one of the world’s most widely-used domain-specific computer languages, ranking among the top dozen computer languages overall.

Many people have contributed to the development of R. As of the composition of this History, the Comprehensive R Archive Network (CRAN) hosted 10,633 contributed packages, each of which aims to extend the capabilities of R in a specific way. R is usually the platform in which new statistical techniques are first implemented by the researchers who develop them. It is widely used in the sciences, business and finance.

Miscellaneous Information

R Packages

This book uses the following contributed R packages that are currently available from CRAN, the Comprehensive R Archive Network:

  • devtools
  • ggplot2
  • mosaicData
  • pryr
  • R6
  • tidyverse

If you are working on the Georgetown College R Studio Server then they are already installed for you. If you have your own installation then you can get them via the install.packages() function, e.g.,

install.packages("devtools")

In addition we use two packages from the author’s Github repositories:

  • bcscr (functions and data sets to accompany the text)
  • tigerData (data sets used in examples and exercises)

If you aren’t working on the R Studio Server at Georgetown College then you may have to install them yourself, as follows:

devtools::install_github("homerhanumat/bcscr")
devtools::install_github("homerhanumat/tigerData")

Typographical Conventions

Computer code, whether within a line of text or as displayed text, appears like this: code snippet.

Identifiers and are represented as is (e.g., variable_name, if, else, while, etc.) except for the names of functions, which are followed by a pair of parentheses in order to stress their status as functions. Thus we write length() for the length-function.

Displayed text representing output to the console appears with double hash-tags at the beginning of each line, thus:

## R is free software and comes with ABSOLUTELY NO WARRANTY.
## You are welcome to redistribute it under certain conditions.
## Type 'license()' or 'licence()' for distribution details.

The hash-tags themselves are not present in the output itself.

Names of R-package are in boldface, thus: package devtools.

Terms are italicized when they are first introduced, e.g.: “R follows a set of internal rules to coerce some of the values to a new type in such a way that all resulting values are of the same type.” Italicization can also indicate emphasis.

Acknowledgements

I am greatly indebted to:

Norman Matloff and Hadley Wickham for their excellent foundational books ((Matloff 2011), (Wickham 2014)) on the R language. Garrett Grolemund, for his informal but precise expository style, as exemplified in numerous R Studio webinars and in (Grolemund 2014).

Allen Downey, for Think Python (Downey 2015). This book formed my ideas about ordering and selection of topics for computer science at an elementary level, and helped me think about teaching computer science in a way that was as independent as possible from the specific language of instruction.

Everyone at R Studio, including Hadley, Yihui Xie for R Markdown and the family of R Markdown-related packages, Joe Cheng for conceiving and pioneering shiny, Winston Chang for shiny and R6, and of course JJ Allaire for developing the IDE and calling together the remarkable constellation of developers and evangelists who have contributed so much to the R community and ecosystem.

Danny Kaplan, Nick Horton and Randall Pruim for pioneering the Mosaic Project that has enabled so many faculty to teach undergraduate statistics with R. I am especially indebted to Danny for curricular inspiration and to Randall for R-programming advice in the incipient stages of my journey as an R-developer. Nick has been a great encourager of everyone associated with the Mosaic community.

My former colleague Rebekah Robinson who, upon learning of Mosaic and R Markdown, insisted that at Georgetown College we take up the challenge of teaching elementary statistics with R.

My colleagues William Harris and Christine Leverenz, who patiently learned to teach in the R way.

My students, especially Woody Burchett, Luke Garnett, Shawn Marcom, Jacob Townsend and Andrew Giles, for work with me on various R-related research projects. Luke has gone on to become a valued colleague at Georgetown.

Georgetown College, for granting me the sabbatical time in Spring 2017 to work on these Notes, and on other programming topics prerequisite to teaching Computer Science.

My wife Mary Lou and daughters Clare, Catherine and Agnes, for patience and support. By now they have heard quite enough about programming.