7.8 New Variables from Old

Quite often you will want to transform one or more variables in a data frame. Transforming a variable means changing its values in a systematic way.

For example, you might want to measure height in feet rather than inches. Then you want the following

If you plan to use this new variable in your analysis later on, it might be a good idea to add it to the data frame:

Another common need is to recode the values of a categorical variable. For example, you might want to divide people into two groups: those who prefer to sit in the back and those who don’t. This is a good time to use ifelse():

If you plan to re-code into a variable that involves more than two values, then you might want to look into the mapvalues() function from the plyr package (Wickham 2016):

##  Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...

The do-it-yourself approach is to write a loop. Remember switch()?

##  chr [1:71] "Front" "Middle" "Middle" "Front" "Back" "Front" "Front" ...

The re-coding is done but the result is a character vector and not a factor. We have to make it a factor ourselves:

This seems like a lot of work!

Another common transformation involves turning a numerical variable into a factor. For example, we might need to classify people as:

  • Tall (height over 70 inches)
  • Medium (65 - 70 inches)
  • Short (less than 65 inches)

The cut() function will be helpful.

##  Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 2 3 1 2 ...

Setting right = TRUE indicates that the upper bound of each interval is included in the interval. Thus, a person with a height of 70 inches is classed as Medium, not Tall.

7.8.1 Getting Rid of Variables

We have added several variables to m111survey. In order to remove them (or any other variables we don’t want) we can assign them the value NULL.

##  [1] "height"          "ideal_ht"        "sleep"          
##  [4] "fastest"         "weight_feel"     "love_first"     
##  [7] "extra_life"      "seat"            "GPA"            
## [10] "enough_Sleep"    "sex"             "diff.ideal.act."
## [13] "height_ft"       "seat2"           "seat3"
##  [1] "height"          "ideal_ht"        "sleep"          
##  [4] "fastest"         "weight_feel"     "love_first"     
##  [7] "extra_life"      "seat"            "GPA"            
## [10] "enough_Sleep"    "sex"             "diff.ideal.act."

7.8.2 Practice Exercises

  1. Remove the variables hispanic and married from the mosaicData::CPS85 data frame.

  2. Change the units of wage in mosaicData::CPS85 from dollars per hour to dollars per day. Assume an eight-hour working day.

  3. For CPS85, create a new variable experGrp that has the following values

    • low for experience less than 10 years;
    • medium for experience of at least 10 years but less than 25 years;
    • high for experience at least 25 years.
  4. Using the experGrp variable in the previous exercise, create the following tally of the ages of the employees:

    ## experGrp
    ##    low medium   high 
    ##    179    217    138
  5. You’ve made some changes to CPS85, but in fact you haven’t changed the original data frame in the mosaicData package—you’ve simply made your own copy, which should now be in your Global Environment. Since the Global Environment comes before any package on your search path, if you want to get to the original CPS85 you will either have to refer to it as mosaicData::CPS85. Another option, though, is to remove the modified copy from your Global Environment. Go ahead and remove it now.

7.8.3 Solutions to Practice Exercises

  1. Here’s one way to do it:

  2. Here’s one way to do it:

  3. Here’s one way to do it:

  4. Use table(CPS85$experGrp).

  5. Here’s what to do:

References

Wickham, Hadley. 2016. Plyr: Tools for Splitting, Applying and Combining Data. https://CRAN.R-project.org/package=plyr.