7.8 New Variables from Old
Quite often you will want to transform one or more variables in a data frame. Transforming a variable means changing its values in a systematic way.
For example, you might want to measure height in feet rather than inches. Then you want the following
<- with(m111survey, height/12) # 12 inches in a foot heightInFeet
If you plan to use this new variable in your analysis later on, it might be a good idea to add it to the data frame:
$height_ft <- heightInFeet m111survey
Another common need is to recode the values of a categorical variable. For example, you might want to divide people into two groups: those who prefer to sit in the back and those who don’t. This is a good time to use ifelse()
:
<- ifelse(m111survey$seat == "3_back", "Back", "Other")
seat2 $seat2 <- seat2 m111survey
If you plan to re-code into a variable that involves more than two values, then you might want to look into the mapvalues()
function from the plyr package (Wickham 2020):
<- plyr::mapvalues(m111survey$seat,
seat3 from = c("1_front", "2_middle", "3_back"),
to = c("Front", "Middle", "Back"))
str(seat3)
## Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...
The do-it-yourself approach is to write a loop. Remember switch()
?
<- m111survey$seat
seat <- character(length(seat)) # this will be the recoded variable
seat3 for ( i in 1:length(seat) ) {
<- switch(as.character(seat[i]),
seat3[i] "1_front" = "Front",
"2_middle" = "Middle",
"3_back" = "Back")
}str(seat3)
## chr [1:71] "Front" "Middle" "Middle" "Front" "Back" "Front" "Front" "Back" "Back" ...
The re-coding is done but the result is a character vector and not a factor. We have to make it a factor ourselves:
$seat3 <- factor(seat3, levels = c("Front", "Middle", "Back")) m111survey
This seems like a lot of work!
Another common transformation involves turning a numerical variable into a factor. For example, we might need to classify people as:
- Tall (height over 70 inches)
- Medium (65 - 70 inches)
- Short (less than 65 inches)
The cut()
function will be helpful.
<- cut(m111survey$height,
heightClass breaks = c(-Inf, 65, 70, Inf),
labels = c("Short", "Medium","Tall"),
right = TRUE)
str(heightClass)
## Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 2 3 1 2 ...
Setting right = TRUE
indicates that the upper bound of each interval is included in the interval. Thus, a person with a height of 70 inches is classed as Medium, not Tall.
7.8.1 Getting Rid of Variables
We have added several variables to m111survey
. In order to remove them (or any other variables we don’t want) we can assign them the value NULL
.
names(m111survey)
## [1] "height" "ideal_ht" "sleep" "fastest"
## [5] "weight_feel" "love_first" "extra_life" "seat"
## [9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
## [13] "height_ft" "seat2" "seat3"
$height_ft <- NULL
m111survey$seat2 <- NULL
m111survey$seat3 <- NULL
m111surveynames(m111survey) # the extra variables are gone
## [1] "height" "ideal_ht" "sleep" "fastest"
## [5] "weight_feel" "love_first" "extra_life" "seat"
## [9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
7.8.2 Practice Exercises
Remove the variables
hispanic
andmarried
from themosaicData::CPS85
data frame.Change the units of
wage
inmosaicData::CPS85
from dollars per hour to dollars per day. Assume an eight-hour working day.For
CPS85
, create a new variableexperGrp
that has the following valueslow
for experience less than 10 years;medium
for experience of at least 10 years but less than 25 years;high
for experience at least 25 years.
Using the
experGrp
variable in the previous exercise, create the following tally of the ages of the employees:## experGrp ## low medium high ## 179 217 138
You’ve made some changes to
CPS85
, but in fact you haven’t changed the original data frame in the mosaicData package—you’ve simply made your own copy, which should now be in your Global Environment. Since the Global Environment comes before any package on your search path, if you want to get to the originalCPS85
you will either have to refer to it asmosaicData::CPS85
. Another option, though, is to remove the modified copy from your Global Environment. Go ahead and remove it now.
7.8.3 Solutions to Practice Exercises
Here’s one way to do it:
$hispanic <- NULL CPS85$married <- NULL CPS85
Here’s one way to do it:
$wage <- CPS85$wage * 8 CPS85
Here’s one way to do it:
$experGrp <- cut(CPS85$exper, CPS85breaks = c(-Inf, 10, 25, Inf), labels = c("low", "medium", "high"))
Use
table(CPS85$experGrp)
.Here’s what to do:
rm(CPS85)