7.4 Introduction to Data Frames
R is known as a domain-specific programming language, meaning that although it can in principle perform any sort of computation that a human can perform (given enough pencil, paper and time), it was originally designed to perform tasks in a particular area of application. R’s area of application is data analysis and statistics, especially when performed interactively—i.e., in a setting where the analyst asks for a relatively small computation, examines the results, modifies his or her requests and asks again, and so on.23 Although R can be used effectively for a wide range of programming tasks, data analysis is where it really shines.
The data structures of R reflect its orientation to data analysis. We have met a data-oriented structure already—the table, which is one of many convenient ways to display the results of data analysis. For the purpose of organizing data in preparation for analysis, R provides the structure known as the data frame. A data frame facilitates the storage of related data in one location, in a form that makes the most sense to human users.
A data frame is like a matrix in that it is two-dimensional—it has rows and columns. Unlike a matrix, though, the elements of a data frame do not have to be all of the same data-type. Each column of a data frame is a vector—of the same length as all the others—but these vectors may be of different types: some numerical, some logical, etc.
7.4.1 Viewing a Data Frame
Let’s take a close look at a data frame: the frame
m111survey, which is available from the bcscr package (White 2021). First let’s attach the package itself:
In the R Studio IDE, we can get a look at the frame in a tab in the Editor pane if we use the
As with many objects provided by a package, we can get more information about it:
From the Help we see that
m111survey records the results of a survey conducted in a number of sections of an elementary statistics course at Georgetown College. From the View we see that the frame is arranged in rows and columns. Each row corresponds to what in data analysis is known as a case or an individual: here, each row goes with a student who participated in the survey. The columns correspond to variables: measurements made on each individual. For a student on a given row, the values in the columns are the values recorded for that student.
When you are not working in R Studio, there are still a couple of way so view the frame. You could print it all out to the console:
You could also use the
head() function to view a specified number of initial rows:
head(m111survey, n = 6) # see first six rows
7.4.2 The Stucture of a Data Frame
Further information about the frame may be obtained with the
## 'data.frame': 71 obs. of 12 variables: ## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ... ## $ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ... ## $ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ... ## $ fastest : int 119 110 85 100 95 100 85 160 90 90 ... ## $ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ... ## $ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ... ## $ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ... ## $ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ... ## $ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ... ## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ... ## $ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
The concept of structure extends far beyond the domain of computer programming.24 In general the structure of any object consists of:
- the kind of thing that the object is;
- the parts of the object is made up of;
- the relationships between these parts—the rules, if you will, for how the parts work together to make the object do what it does.
In the case of
m111survey the kind of thing this is its class: it’s a data frame.
##  "data.frame"
Next we see the account of the parts of the object and the way in which the parts relate to one another:
## 71 obs. of 12 variables
From this we know that there are 71 individuals in the study. The data consists of 12 “parts”—the variables—which are related in the sense that they all provide information about the same set of 71 people.
After that the output of
str() launches into an account of the structure of each of the parts, for example:
## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
We are told the kind of thing that height is: it’s a numerical vector (a vector of type
double, in fact). Next we are given the beginning of a statement of its parts: the heights of the individuals. So R is actually giving us the structure of the parts, as well as of the whole
fastest refers to the fastest speed—in miles per hour—that a person has ever driven a car. Note that it is a vector of type
integer. Officially this is a numerical variable, too, but R is calling attention to the fact that the fastest-speed data is being stored as integers rather than as floating-point decimals.
The variables of a data frame are typically associated with the names of the frame:
##  "height" "ideal_ht" "sleep" "fastest" ##  "weight_feel" "love_first" "extra_life" "seat" ##  "GPA" "enough_Sleep" "sex" "diff.ideal.act."
By means of the names we can isolate a vector in any column, identified in our code in the format
frame$variable. For example, to see the first ten elements of the
fastest variable, we ask for:
##  119 110 85 100 95 100 85 160 90 90
In order to compute the mean fastest speed our subjects drove their cars, we can ask for:
mean(m111survey$fastest, na.rm = TRUE)
##  105.9014
If you want to see the speeds that are at least 150 miles per hour, you could ask for:
$fastest[m111survey$fastest >= 150]m111survey
##  160 190
If you worry that the form
frame$variable will require an annoying amount of typing—as seems to be the case in the the example above—then you can use the
with(m111survey, fastest[fastest >=150])
##  160 190
It’s instructive to consider how
with() works. If we were to includes the names of the parameters of
with() explicitly, then the call would have looked like this:
with(data = m111survey, expr = fastest[fastest >=150])
data parameter we can supply a data frame or any other R-object that can be used to construct an environment . In this case
m111survey provides a miniature environment consisting of the names of its variables. For the
expr parameter we supply an expression for R to evaluate. As R evaluates the expression, it encounters names (such as
fastest). Now ordinarily R would first search whatever counts as the active environment—in this case it’s the Global Environment—for the names in the expression, but
with() forces R to look first within the environment created by the
data argument. In our example, R finds
m111survey and evaluates the expression on that basis. If it had not found
m111survey, R would have moved on to the Global Environment and then the rest of the usual search path and (probably) would have found nothing, causing it to throw an “object not found” error message. In R, as in any other programming language, good programming depends very much on paying attention to how the language searches for the objects to which names refer.
Some of the variables in
m111survey are called factors; an example is
seat, which pertains to where one prefers to sit in a classroom:
## Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
Seating preference is an example of a categorical variable: one whose values are not meaningfully expressed in terms of numbers. When a categorical variable has a relatively small number of possible values, it can be convenient to store its values in a vector of class
The levels of factor variable are its possible values. In the case of
seat, these are: Front, Middle and Back. As a memory-saving measure, R stores the values in the factor as numbers, where 1 stands for the first level, 2 for the second level, and so on. But please bear in mind that we are dealing with a categorical variable, so the numbers don’t relate to the possible values in any natural way: they are just storage conventions.
It’s possible to create a factor from any type of vector, but most often this is done with a character vector. Suppose for instance, that eight people are asked for their favorite Wizard of Oz character and they answer:
<- c("Glinda", "Toto", "Toto", "Dorothy", "Toto", ozFavs "Glinda", "Scarecrow", "Dorothy")
We can create a factor variable as follows:
<- factor(ozFavs) factorFavs factorFavs
##  Glinda Toto Toto Dorothy Toto Glinda Scarecrow Dorothy ## Levels: Dorothy Glinda Scarecrow Toto
Note that the levels are given in alphabetical order: this is the default procedure when R creates a factor. It is possible to ask for a different order, though:
factor(ozFavs, levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))
##  Glinda Toto Toto Dorothy Toto Glinda Scarecrow Dorothy ## Levels: Toto Scarecrow Glinda Dorothy
In many instances it is appropriate to convert a character vector to a factor, but sometimes this is not such a great idea. Consider something like your address, or your favorite inspirational quote: pretty much every person in a study will have a different address or favorite quote than others in the study. Hence there won’t be any memory-storage benefit associated with creating a factor: the vector of levels—itself a character vector—would require as much storage space as the original character vector itself! In addition, we will see that the status of a variable as class “factor” can affect how R’s statistical and graphical functions deal with it. It’s not a good idea to treat a categorical variable as a factor unless its set of possible values is considered important.
We will think more about how to deal with factor variables later on, when we begin data analysis in earnest.
7.4.4 Practice Exercises
How would you learn more about the data frame
RailTrailfrom the mosaicData package?
Write a one-line command to see the first 10 rows of
RailTrailin the Console.
Write a one-line command to get the names of all of the variables in
RailTrail: write a one-line command to get the high temperature on all the days when the precipitation was more than 0.5 inches.
RailTrail: write a one-line command to sort the average temperatures from highest to lowest.
7.4.5 Solutions to the Practice Exercises
One way is to attach, then then ask for help:
Another way is to refer to the data frame through the package, with double colons:
That way you don’t have to add all the items in mosaicData to your search path.
Here’s one way:
head(mosaicData::RailTrail, n = 10)
Here’s one way:
with(mosaicData::RailTrail, hightemp[precip > 0.5])
sort(mosaicData::RailTrail$avgtemp, decreasing = TRUE)
As an example outside of programming, consider what happens when you read a piece of literature “for structure.” You begin by asking: “What kind of literature is this? Is it drama, a novel, or something else?” The answer lets you know what to expect as you read: if it’s a novel, you know to suspend disbelief, whereas if it’s a journalistic piece then you know to examine critically whatever it presents as fact. Next, you might outline the piece. When you make an outline, you are breaking the piece up into parts, and indicating how the parts relate to each other to advance the plot and/or message of the piece. Note that in the process of “reading for structure” you are following the pattern of the definition of structure offered above.↩︎