13.3 Reading Text Files with readr

words.txt was unusually simple—just one piece of data per line. It made sense to read it into a vector. Most of the files that we encounter have a more complex structure. In many cases it is best to incorporate them into R as data frames.

R has several tools for reading data into data frames. One of the most convenient—and most popular—is the set of functions in the readr package (Wickham and Hester 2020). It’s attached automatically when you load the tidy-verse, but you can also attach it separately:

library(readr)

Let’s practice reading in a file from the Internet. The URL is:

https://query.data.world/s/b6plbxp3ym20s5a5iey36geul

The file is taken from data.world, a site very much worth exploring, and it addresses part of a study on alcohol consumption and life-expectancy in the nations of the world. (The file-extension .csv indicates that the it is expected to use commas to separate data values.)

The easiest way to learn the functions of readr is to use a widget supplied by the R Studio IDE:

  • In the Environment Pane, find the drop-down menu Import Dataset,
  • Ask to Import CSV ….
  • You will be taken to a dialog box. Enter the above URL in the File/URL field and press the Update button.
  • A preview of the data is shown.
  • At the bottom left there are a number of Import Options. The most important one at this point is the Name. Instead of the messy name that is shown, type something descriptive. (We’ll choose alcGDP.)
  • Note the Code Preview box at the bottom right. It contains the R-commands needed to download and read the text-file alcGDP into R as a data frame.
  • You should copy the code and save it somewhere (in an R script or R Markdown document) in order to keep a record of your work.
  • You may then press the Import button.

The data is read into alcGDP according to the read_csv() call below:

alcGDP <- read_csv("https://query.data.world/s/b6plbxp3ym20s5a5iey36geul")

If you pressed the Import button, then you can see alcGDP in the Editor window. Notice that is has five variables. Two of them have only NA values, and the final two of them have names that are simply too long to be practical. Go ahead and fix this:

names(alcGDP)[c(4,5)] <- c("liters", "gdp")
alcGDP$YearDisplay <- NULL
alcGDP$SexDisplay <- NULL

Now take a look at the frame:

head(alcGDP)
## # A tibble: 6 x 3
##   country     liters    gdp
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    0    20842
## 2 Albania        4.9  13370
## 3 Azerbaijan     1.3  75198
## 4 Madagascar     0.8  10593
## 5 Malawi         1.5   4258
## 6 Malaysia       0.3 326933

liters gives the mean total liters of alcohol consumed per person in each country, and GDP is the country’s Gross Domestic Product in millions of dollars.

From the Console output you can tell that alcGDP is a tibble rather than a data frame. (The default behavior for read_csv() is to import data as tibbles.)

We won’t analyze alcGDP here, but you might want to look at it later. You can save it permanently to your Home directory with the save() function:

save(alcGDP, file = "downloads/alcGDP.rda")

alcGDP is still in your Global Environment. If you clear the environment later on, you can reload alcGDP as follows:

load("downloads/alcGDP.rda")