7.4 Introduction to Data Frames

R is known as a domain-specific programming language, meaning that although it can in principle perform any sort of computation that a human can perform (given enough pencil, paper and time), it was originally designed to perform tasks in a particular area of application. R’s area of application is data analysis and statistics, especially when performed interactively—i.e., in a setting where the analyst asks for a relatively small computation, examines the results, modifies his or her requests and asks again, and so on.23 Although R can be used effectively for a wide range of programming tasks, data analysis is where it really shines.

The data structures of R reflect its orientation to data analysis. We have met a data-oriented structure already—the table, which is one of many convenient ways to display the results of data analysis. For the purpose of organizing data in preparation for analysis, R provides the structure known as the data frame. A data frame facilitates the storage of related data in one location, in a form that makes the most sense to human users.

A data frame is like a matrix in that it is two-dimensional—it has rows and columns. Unlike a matrix, though, the elements of a data frame do not have to be all of the same data-type. Each column of a data frame is a vector—of the same length as all the others—but these vectors may be of different types: some numerical, some logical, etc.

7.4.1 Viewing a Data Frame

Let’s take a close look at a data frame: the frame m111survey, which is available from the bcscr package (White 2018a). First let’s attach the package itself:

library(bcscr)

In the R Studio IDE, we can get a look at the frame in a tab in the Editor pane if we use the View() function:

View(m111survey)

As with many objects provided by a package, we can get more information about it:

help("m111survey")

From the Help we see that m111survey records the results of a survey conducted in a number of sections of an elementary statistics course at Georgetown College. From the View we see that the frame is arranged in rows and columns. Each row corresponds to what in data analysis is known as a case or an individual: here, each row goes with a student who participated in the survey. The columns correspond to variables: measurements made on each individual. For a student on a given row, the values in the columns are the values recorded for that student.

When you are not working in R Studio, there are still a couple of way so view the frame. You could print it all out to the console:

m111survey

You could also use the head() function to view a specified number of initial rows:

head(m111survey, n = 6)  # see first six rows

7.4.2 The Stucture of a Data Frame

Further information about the frame may be obtained with the str() function:

str(m111survey)
## 'data.frame':    71 obs. of  12 variables:
##  $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
##  $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
##  $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
##  $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
##  $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
##  $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
##  $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
##  $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
##  $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
##  $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
##  $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

The concept of structure extends far beyond the domain of computer programming.24 In general the structure of any object consists of:

  • the kind of thing that the object is;
  • the parts of the object is made up of;
  • the relationships between these parts—the rules, if you will, for how the parts work together to make the object do what it does.

In the case of m111survey the kind of thing this is its class: it’s a data frame.

class(m111survey)
## [1] "data.frame"

Next we see the account of the parts of the object and the way in which the parts relate to one another:

## 71 obs. of  12 variables

From this we know that there are 71 individuals in the study. The data consists of 12 “parts”—the variables—which are related in the sense that they all provide information about the same set of 71 people.

After that the output of str() launches into an account of the structure of each of the parts, for example:

## $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...

We are told the kind of thing that height is: it’s a numerical vector (a vector of type double, in fact). Next we are given the beginning of a statement of its parts: the heights of the individuals. So R is actually giving us the structure of the parts, as well as of the whole m11survey.

The variable fastest refers to the fastest speed—in miles per hour—that a person has ever driven a car. Note that it is a vector of type integer. Officially this is a numerical variable, too, but R is calling attention to the fact that the fastest-speed data is being stored as integers rather than as floating-point decimals.

The variables of a data frame are typically associated with the names of the frame:

names(m111survey)
##  [1] "height"          "ideal_ht"        "sleep"          
##  [4] "fastest"         "weight_feel"     "love_first"     
##  [7] "extra_life"      "seat"            "GPA"            
## [10] "enough_Sleep"    "sex"             "diff.ideal.act."

By means of the names we can isolate a vector in any column, identified in our code in the format frame$variable. For example, to see the first ten elements of the fastest variable, we ask for:

m111survey$fastest[1:10]
##  [1] 119 110  85 100  95 100  85 160  90  90

In order to compute the mean fastest speed our subjects drove their cars, we can ask for:

mean(m111survey$fastest, na.rm = TRUE)
## [1] 105.9014

If you want to see the speeds that are at least 150 miles per hour, you could ask for:

m111survey$fastest[m111survey$fastest >= 150]
## [1] 160 190

If you worry that the form frame$variable will require an annoying amount of typing—as seems to be the case in the the example above—then you can use the with() function:

with(m111survey, fastest[fastest >=150])
## [1] 160 190

It’s instructive to consider how with() works. If we were to includes the names of the parameters of with() explicitly, then the call would have looked like this:

with(data = m111survey, expr = fastest[fastest >=150])

For the data parameter we can supply a data frame or any other R-object that can be used to construct an environment . In this case m111survey provides a miniature environment consisting of the names of its variables. For the expr parameter we supply an expression for R to evaluate. As R evaluates the expression, it encounters names (such as fastest). Now ordinarily R would first search whatever counts as the active environment—in this case it’s the Global Environment—for the names in the expression, but with() forces R to look first within the environment created by the data argument. In our example, R finds fastest inside m111survey and evaluates the expression on that basis. If it had not found fastest in m111survey, R would have moved on to the Global Environment and then the rest of the usual search path and (probably) would have found nothing, causing it to throw an “object not found” error message. In R, as in any other programming language, good programming depends very much on paying attention to how the language searches for the objects to which names refer.

7.4.3 Factors

Some of the variables in m111survey are called factors; an example is seat, which pertains to where one prefers to sit in a classroom:

str(m111survey$seat)
##  Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...

Seating preference is an example of a categorical variable: one whose values are not meaningfully expressed in terms of numbers. When a categorical variable has a relatively small number of possible values, it can be convenient to store its values in a vector of class factor.

The levels of factor variable are its possible values. In the case of seat, these are: Front, Middle and Back. As a memory-saving measure, R stores the values in the factor as numbers, where 1 stands for the first level, 2 for the second level, and so on. But please bear in mind that we are dealing with a categorical variable, so the numbers don’t relate to the possible values in any natural way: they are just storage conventions.

It’s possible to create a factor from any type of vector, but most often this is done with a character vector. Suppose for instance, that eight people are asked for their favorite Wizard of Oz character and they answer:

ozFavs <- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
            "Glinda", "Scarecrow", "Dorothy")

We can create a factor variable as follows:

factorFavs <- factor(ozFavs)
factorFavs
## [1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
## [8] Dorothy  
## Levels: Dorothy Glinda Scarecrow Toto

Note that the levels are given in alphabetical order: this is the default procedure when R creates a factor. It is possible to ask for a different order, though:

factor(ozFavs, levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))
## [1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
## [8] Dorothy  
## Levels: Toto Scarecrow Glinda Dorothy

In many instances it is appropriate to convert a character vector to a factor, but sometimes this is not such a great idea. Consider something like your address, or your favorite inspirational quote: pretty much every person in a study will have a different address or favorite quote than others in the study. Hence there won’t be any memory-storage benefit associated with creating a factor: the vector of levels—itself a character vector—would require as much storage space as the original character vector itself! In addition, we will see that the status of a variable as class “factor” can affect how R’s statistical and graphical functions deal with it. It’s not a good idea to treat a categorical variable as a factor unless its set of possible values is considered important.

We will think more about how to deal with factor variables later on, when we begin data analysis in earnest.

7.4.4 Practice Exercises

  1. How would you learn more about the data frame RailTrail from the mosaicData package?

  2. Write a one-line command to see the first 10 rows of RailTrail in the Console.

  3. Write a one-line command to get the names of all of the variables in mosaicData::RailTrail.

  4. Regarding RailTrail: write a one-line command to get the high temperature on all the days when the precipitation was more than 0.5 inches.

  5. Regarding RailTrail: write a one-line command to sort the average temperatures from highest to lowest.

7.4.5 Solutions to the Practice Exercises

  1. One way is to attach, then then ask for help:

    library(mosaicData)
    help(RailTrail)

    Another way is to refer to the data frame through the package, with double colons:

    library(mosaicData::RailTrail)

    That way you don’t have to add all the items in mosaicData to your search path.

  2. Here’s one way:

    head(mosaicData::RailTrail, n = 10)
  3. It’s names(mosaicData::RailTrail).

  4. Here’s one way:

    with(mosaicData::RailTrail, hightemp[precip > 0.5])
  5. Try this:

    sort(mosaicData::RailTrail$avgtemp, decreasing = TRUE)

References

White, Homer. 2018a. Bcscr: Beginning Computer Science with R. https://github.com/homerhanumat/bcscr.


  1. Domain-specific languages (DSLs for short) stand in contrast to general-purpose programming languages that were designed to solve a wide variety of problems. Examples of important general-purpose languages include C and C++, Java, Python and Ruby. Although R is by now the one of the most widely-used DSLs in the world, there a number of other important ones, including Matlab, Octave and Julia for scientific computing, Emacs Lisp for the renowned Emacs editor, and SQL for querying databases. JavaScript is an interesting case: it started out as a DSL for web browsers, but has since expanded to power many web applications and is now being used to develop desktop applications as well.

  2. As an example outside of programming, consider what happens when you read a piece of literature “for structure.” You begin by asking: “What kind of literature is this? Is it drama, a novel, or something else?” The answer lets you know what to expect as you read: if it’s a novel, you know to suspend disbelief, whereas if it’s a journalistic piece then you know to examine critically whatever it presents as fact. Next, you might outline the piece. When you make an outline, you are breaking the piece up into parts, and indicating how the parts relate to each other to advance the plot and/or message of the piece. Note that in the process of “reading for structure” you are following the pattern of the definition of structure offered above.