8.4 Factor Variables in Plotting

It is worthwhile to reconsider factor variables (see 7.4.3) and to consider their role in the appearance of plots. We’ll do this by way of an example.

Consider the data frame firesetting in the tigerData package:

As we see from the Help file, fire-setting offers information about a large sample of high-school students. Some of them are known to be arsonists, while of course most are not. The purpose for gathering the data was to determine what characteristics of a child might be risk factors in whether he/she would develop into a fire-setter. Among the variables of interest are:

  • race, the ethnicity of the student, a factor with three levels: “white”, “black”, and “other”;
  • school.attitude, a scaled score on a personality inventory in which high scores indicate a poor attitude toward school;
  • fires, a factor variable with levels “0” (does not set fires) and “1” (sets fires).

We might study the relationship between these variables with some box-plots via the code below. The results appear as Figure 8.33.

Boxplots with jittered individual values.  School attitudes are generally worse among fire-setters.

Figure 8.33: Boxplots with jittered individual values. School attitudes are generally worse among fire-setters.

We could improve the appearance of the plot with respect to two of the variables involved:

  • fires: We should display values that are more meaningful to a human viewer than 0 and 1.
  • race: Let’s change the names of the values somewhat, keeping their current order along the x-axis.

For fires, we simply need to map the current values onto others that we prefer. Currently the levels are:

## [1] "0" "1"

We can re-map as follows:

For race, we’d like to substitute

  • “White” for “white”;
  • “AfrAm” for “black”
  • “Other” for “other”
  • “Unknown” instead of NA

We’d like the order along the horizontal axis to remain as it is.

We might begin by replacing the NA-values with “Unknown”, as follows:

We get an error! That’s because race is a factor with only three “possible values”, as given by its levels:

## [1] "white" "black" "other"

“Unknown” is not considered a “possible value” for tempRace, any attempt to set that value in any elements of tempRace will be resisted.

We can overcome this resistance by coercing tempRace into a mere character-vector:

Now tempRace doesn’t come with a limit on its “possible values”. We try again:

This apparently worked. Let’s check:

## [1] "white"   "other"   "Unknown" "black"

So far, so good. Now let’s re-map the other three values:

Let’s check that it worked:

## [1] "White"   "Other"   "Unknown" "AfrAm"

Great. Now let’s add tempRace2 as a new variable into firsetting. In that data frame, it will be called betterRace. Here’s the code to do this:

Let’s try the graph again. We will need to modify the code a bit, so that we’re using the new and improved variable betterRace, as well as the new variable betterFires. Figure 8.34 is the result.

We have re-mapped the fires and race variables.

Figure 8.34: We have re-mapped the fires and race variables.

This looks better, but now the order along the x-axis isn’t as we desired. Recall that race is now a character vector only. Since it doesn’t have levels that specify a particular order, ggplot2 uses alphabetical order as the default ordering along the x-axis.

Accordingly we should convert betterRace to a factor variable, and be careful to set the levels in the order that we want:

Now try again. Note that in the code below we switch to evenBetterRace. See Figure 8.35 for the result.

Now the order of the race-values is correct!

Figure 8.35: Now the order of the race-values is correct!

This works!

8.4.1 Practice Exercises

  1. With mosaicData::KidsFeet, make the following graph. (Hint: Use plyr::mapvalues().)

  2. With mosaicData::TenMileRace, make the following graph. (Hint: Use cut() to make age groups. The fill for the boxes is the ever-popular "burlywood".)

  3. With mosaicData::Gestation, make the following graph. (Hint: Use !is.na() to select the rows of Gestation where smoke is not NA. Then use plyr::mapvalues() on smoke.)

8.4.2 Solutions to Practice Exercises

  1. Here’s the code:

  2. Here’s the code:

  3. Here’s the code: