8.4 Factor Variables in Plotting
It is worthwhile to reconsider factor variables (see 7.4.3) and to consider their role in the appearance of plots. We’ll do this by way of an example.
Consider the data frame firesetting
in the tigerData package:
data("firesetting", package = "tigerData")
::firesetting ?tigerData
As we see from the Help file, fire-setting offers information about a large sample of high-school students. Some of them are known to be arsonists, while of course most are not. The purpose for gathering the data was to determine what characteristics of a child might be risk factors in whether he/she would develop into a fire-setter. Among the variables of interest are:
race
, the ethnicity of the student, a factor with three levels: “white,” “black,” and “other”;school.attitude
, a scaled score on a personality inventory in which high scores indicate a poor attitude toward school;fires
, a factor variable with levels “0” (does not set fires) and “1” (sets fires).
We might study the relationship between these variables with some box-plots via the code below. The results appear as Figure 8.33.
ggplot(firesetting, aes(x = race, y = school.attitude)) +
geom_boxplot(fill = "burlywood",
outlier.alpha = 0) +
geom_jitter(width = 0.20, size = 0.1) +
facet_wrap(~ fires, nrow = 2) +
labs(x = "Ethnicity",
y = "School-Attitude Score")
We could improve the appearance of the plot with respect to two of the variables involved:
fires
: We should display values that are more meaningful to a human viewer than 0 and 1.race
: Let’s change the names of the values somewhat, keeping their current order along the x-axis.
For fires
, we simply need to map the current values onto others that we prefer. Currently the levels are:
levels(firesetting$fires)
## [1] "0" "1"
We can re-map as follows:
<- plyr::mapvalues(firesetting$fires, from = c("0", "1"),
betterFires to = c("no fires", "sets fires"))
$betterFires <- betterFires firesetting
For race
, we’d like to substitute
- “White” for “white”;
- “AfrAm” for “black”
- “Other” for “other”
- “Unknown” instead of
NA
We’d like the order along the horizontal axis to remain as it is.
We might begin by replacing the NA
-values with “Unknown,” as follows:
<- firesetting$race
tempRace is.na(tempRace)] <- "Unknown" tempRace[
We get an error! That’s because race
is a factor with only three “possible values,” as given by its levels:
levels(tempRace)
## [1] "white" "black" "other"
“Unknown” is not considered a “possible value” for tempRace
, any attempt to set that value in any elements of tempRace
will be resisted.
We can overcome this resistance by coercing tempRace
into a mere character-vector:
<- as.character(tempRace) tempRace
Now tempRace
doesn’t come with a limit on its “possible values.” We try again:
is.na(tempRace)] <- "Unknown" tempRace[
This apparently worked. Let’s check:
unique(tempRace)
## [1] "white" "other" "Unknown" "black"
So far, so good. Now let’s re-map the other three values:
<- plyr::mapvalues(tempRace,
tempRace2 from = c("white", "black", "other"),
to = c("White", "AfrAm", "Other"))
Let’s check that it worked:
unique(tempRace2)
## [1] "White" "Other" "Unknown" "AfrAm"
Great. Now let’s add tempRace2
as a new variable into firsetting
. In that data frame, it will be called betterRace
. Here’s the code to do this:
$betterRace <- tempRace2 firesetting
Let’s try the graph again. We will need to modify the code a bit, so that we’re using the new and improved variable betterRace
, as well as the new variable betterFires
. Figure 8.34 is the result.
ggplot(firesetting, aes(x = betterRace, y = school.attitude)) +
geom_boxplot(fill = "burlywood",
outlier.alpha = 0) +
geom_jitter(width = 0.20, size = 0.1) +
facet_wrap(~ betterFires, nrow = 2) +
labs(x = "Ethnicity",
y = "School-Attitude Score")
This looks better, but now the order along the x-axis isn’t as we desired. Recall that race
is now a character vector only. Since it doesn’t have levels that specify a particular order, ggplot2 uses alphabetical order as the default ordering along the x-axis.
Accordingly we should convert betterRace
to a factor variable, and be careful to set the levels in the order that we want:
<- firesetting$betterRace
tempRace <- factor(tempRace,
betterRace2 levels = c("White", "AfrAm", "Other", "Unknown"))
$evenBetterRace <- betterRace2 firesetting
Now try again. Note that in the code below we switch to evenBetterRace
. See Figure 8.35 for the result.
ggplot(firesetting, aes(x = evenBetterRace, y = school.attitude)) +
geom_boxplot(fill = "burlywood",
outlier.alpha = 0) +
geom_jitter(width = 0.20, size = 0.1) +
facet_wrap(~ betterFires, nrow = 2) +
labs(x = "Ethnicity",
y = "School-Attitude Score")
This works!
8.4.1 Practice Exercises
With
mosaicData::KidsFeet
, make the following graph. (Hint: Useplyr::mapvalues()
.)With
mosaicData::TenMileRace
, make the following graph. (Hint: Usecut()
to make age groups. The fill for the boxes is the ever-popular"burlywood"
.)With
mosaicData::Gestation
, make the following graph. (Hint: Use!is.na()
to select the rows ofGestation
wheresmoke
is notNA
. Then useplyr::mapvalues()
onsmoke
.)
8.4.2 Solutions to Practice Exercises
Here’s the code:
data("KidsFeet", package = "mosaicData") <- KidsFeet$biggerfoot tempBiggerfoot <- plyr::mapvalues(tempBiggerfoot, c("L", "R"), biggerfoot2 to = c("left foot", "right foot")) $betterBiggerfoot <- biggerfoot2 KidsFeet<- KidsFeet$domhand tempDomhand <- plyr::mapvalues(tempDomhand, from = c("L", "R"), domhand2 to = c("left hand", "right hand")) $betterDomhand <- domhand2 KidsFeetggplot(KidsFeet, aes(x = betterBiggerfoot)) + geom_bar(aes(fill = betterDomhand), position = "dodge") + labs(x = "the bigger foot", title = "Foot-size and Handedness for 39 Children", subtitle = paste0("When your right foot is bigger, ", "are you more likely to be right-handed?"))
Here’s the code:
data("TenMileRace", package = "mosaicData") <- TenMileRace$sex tempSex <- plyr::mapvalues(tempSex, from = c("F", "M"), sex2 to = c("female", "male")) $betterSex <- sex2 TenMileRace<- cut(TenMileRace$age, ageGroup breaks = c(-Inf, 20, 30, 40, 50, 60, 70, Inf), labels = c("<20", "20s", "30s", "40s", "50s", "60s", "70+")) $ageGroup <- ageGroup TenMileRaceggplot(TenMileRace, aes(x = ageGroup, y = time)) + geom_boxplot(fill = "burlywood") + facet_grid(betterSex ~ .) + labs(x = "Age Group", y = "net time to finish (sec)", title = "Ten-Mile Race Times")
Here’s the code:
data("Gestation", package = "mosaicData") <- subset(Gestation, !is.na(smoke)) Gestation <- Gestation$smoke tempSmoke <- plyr::mapvalues(tempSmoke, from = 0:3, smoke2 to = c("never", "smokes now", "until curr. preg.", "once smoked")) $betterSmoke <- smoke2 Gestationggplot(Gestation, aes(x = factor(betterSmoke), y = wt)) + geom_boxplot(fill = "burlywood") + labs(x = "smoking status of mother", y = "birth weight (ounces)", title = "Birth Weights", subtitle = "Children of smoking mothers have lower birth-weights.")