13.3 Reading Text Files with readr
words.txt was unusually simple—just one piece of data per line. It made sense to read it into a vector. Most of the files that we encounter have a more complex structure. In many cases it is best to incorporate them into R as data frames.
R has several tools for reading data into data frames. One of the most convenient—and most popular—is the set of functions in the readr package (Wickham and Hester 2020). It’s attached automatically when you load the tidy-verse, but you can also attach it separately:
Let’s practice reading in a file from the Internet. The URL is:
The file is taken from data.world, a site very much worth exploring, and it addresses part of a study on alcohol consumption and life-expectancy in the nations of the world. (The file-extension
.csv indicates that the it is expected to use commas to separate data values.)
The easiest way to learn the functions of readr is to use a widget supplied by the R Studio IDE:
- In the Environment Pane, find the drop-down menu Import Dataset,
- Ask to Import CSV ….
- You will be taken to a dialog box. Enter the above URL in the File/URL field and press the Update button.
- A preview of the data is shown.
- At the bottom left there are a number of Import Options. The most important one at this point is the Name. Instead of the messy name that is shown, type something descriptive. (We’ll choose
- Note the Code Preview box at the bottom right. It contains the R-commands needed to download and read the text-file
alcGDPinto R as a data frame.
- You should copy the code and save it somewhere (in an R script or R Markdown document) in order to keep a record of your work.
- You may then press the Import button.
The data is read into
alcGDP according to the
read_csv() call below:
If you pressed the Import button, then you can see
alcGDP in the Editor window. Notice that is has five variables. Two of them have only
NA values, and the final two of them have names that are simply too long to be practical. Go ahead and fix this:
names(alcGDP)[c(4,5)] <- c("liters", "gdp") $YearDisplay <- NULL alcGDP$SexDisplay <- NULLalcGDP
Now take a look at the frame:
## # A tibble: 6 x 3 ## country liters gdp ## <chr> <dbl> <dbl> ## 1 Afghanistan 0 20842 ## 2 Albania 4.9 13370 ## 3 Azerbaijan 1.3 75198 ## 4 Madagascar 0.8 10593 ## 5 Malawi 1.5 4258 ## 6 Malaysia 0.3 326933
liters gives the mean total liters of alcohol consumed per person in each country, and
GDP is the country’s Gross Domestic Product in millions of dollars.
From the Console output you can tell that
alcGDP is a tibble rather than a data frame. (The default behavior for
read_csv() is to import data as tibbles.)
We won’t analyze
alcGDP here, but you might want to look at it later. You can save it permanently to your Home directory with the
save(alcGDP, file = "downloads/alcGDP.rda")
alcGDP is still in your Global Environment. If you clear the environment later on, you can reload
alcGDP as follows: