2.6 Subsetting with Logical Vectors

The subsetting we have seen up to now involves specifying the indices of the elements we would like to select from the original vector. It is also possible to say, for each element, whether or not it is to be included in our selection. This is accomplished by means of logical vectors.

Recall our heights vector:

heights
## Scarecrow    Tinman      Lion   Dorothy      Toto       Boq 
##        73        70        69        60        NA        46

Let’s say that we want the heights of Scarecrow, Tinman and Dorothy. We can use a logical vector to do this:

wanted <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)
heights[wanted]
## Scarecrow    Tinman   Dorothy 
##        73        70        60

The TRUE’s at indices 1, 2, and 4 in wanted inform R that we want the heights vector at indices 1, 2 and 4. The FALSE’s say: “don’t include this element!”

Subsetting can be used powerfully along with logical vectors and Boolean operators.

For example, in order to select those persons whose heights exceed a certain amount, we might say something like this:

#heights of some people:
people <- c(55, 64, 67, 70, 63, 72)
tall <- (people >= 70)
tall
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE
people[tall]
## [1] 70 72

As you can see, the tall vector specifies which elements we would like to select from the people vector.

We need not define the tall vector along the way. It is quite common to see something like the following:

people[people >= 70]
## [1] 70 72

I like to pronounce the above as:

people, where people is at least 70

The word “where” in the above phrase corresponds to the subsetting operator.

Your subsetting logical vector need not have been constructed with the original vector in mind. Consider the following example:

age <- c(23, 21, 22, 25, 63)
height <- c(68, 67, 71, 70, 69)
age[height < 70]
## [1] 23 21 63

Here the selection is done from the age vector, using a logical vector that was constructed from height—another vector altogether. It concisely expresses the idea:

the ages of people whose height is less than 70

There is no limit to the complexity of selection. Consider the following:

age <- c(23, 21, 22, 25, 63)
height <- c(68, 67, 71, 70, 69)
likesToto <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
height[age < 60 & likesToto]
## [1] 68 67

2.6.1 Counting

Logical subsetting provides a convenient way to count the elements of a vector that possess a given property. For example, to find out how many elements of people are less than 70 we could say:

length(people[people < 70])
## [1] 4

2.6.2 Cautions about NA

You should be aware of the effect of NA-values on subsetting.

heights
## Scarecrow    Tinman      Lion   Dorothy      Toto       Boq 
##        73        70        69        60        NA        46
tall <- (heights > 65)
tall
## Scarecrow    Tinman      Lion   Dorothy      Toto       Boq 
##      TRUE      TRUE      TRUE     FALSE        NA     FALSE

Since Toto’s height was missing, R can’t say whether or not he was more than 65 inches tall. Hence it assigns NA to the Toto-element of the tall vector.

When we subset using this vector we get an odd result:

heights[tall]
## Scarecrow    Tinman      Lion      <NA> 
##        73        70        69        NA

Since R doesn’t know whether or not to select Toto, it records its indecision by including an NA in the result. That NA, however, is not the NA for Toto’s height in the vector heights, so it can’t inherit the “Toto” name. Since it has no name, R presents its name as <NA>.

If we try to count the number of tall persons, we get a misleading result:

length(heights[tall])
## [1] 4

We would have preferred something like:

“Three, with another one undecided.”

Counting is one those situations in which we might wish to remove NA values at the start. If the vector is small we could remove them by hand, e.g.:

knownHeights <- heights[-5]  # remove Toto
tall <- (knownHeights > 65)
length(knownHeights[tall])
## [1] 3

For longer vectors the above approach won’t be practical. Instead we may use the is.na() function.

is.na(heights)
## Scarecrow    Tinman      Lion   Dorothy      Toto       Boq 
##     FALSE     FALSE     FALSE     FALSE      TRUE     FALSE

Then we may select those elements that are not NA:

knownHeights <- heights[!is.na(heights)]
knownHeights
## Scarecrow    Tinman      Lion   Dorothy       Boq 
##        73        70        69        60        46
length(knownHeights[knownHeights > 65])
## [1] 3

2.6.3 Which, Any, All

There are several functions on logical vectors that are worth keeping in your back pocket:

  • which()
  • any()
  • all()

2.6.3.1 which()

Applied to a logical vector, the which() function returns the indices of the vector that have the value TRUE:

boolVec <- c(TRUE,TRUE,FALSE,TRUE)
which(boolVec)
## [1] 1 2 4

Thus if we want to know the indices of heights where the heights are at least 65, then we write:

which(heights > 65)
## Scarecrow    Tinman      Lion 
##         1         2         3

(Recall that height was a named vector. The logical vector heights > 65 inherited these names and passed them on to the result of whihc().)

Note also that Toto’s NA height was ignored by which().

2.6.3.2 any()

Is anyone more than 71 inches tall? any() will tell us:

heights
## Scarecrow    Tinman      Lion   Dorothy      Toto       Boq 
##        73        70        69        60        NA        46
any(heights > 71)
## [1] TRUE

Yes: the Scarecrow is more than 71 inches tall.

We can use any() along with the equality Boolean operator == to determine whether or not a given value appears a a given vector:

vec <- c("Dorothy", "Tin Man", "Scarecrow", "Glinda")
any(vec == "Tin Man")
## [1] TRUE
any(vec == "Wizard")
## [1] FALSE

The above question occurs so frequently that R provides the %in% operator as a short-cut:

"Tin Man" %in% vec
## [1] TRUE
"Wizard" %in% vec
## [1] FALSE

2.6.3.3 all()

Is everyone more than 71 inches tall?

all(heights > 71)
## [1] FALSE

2.6.3.4 NA-Caution

Is everyone more than 40 inches tall?

all(heights > 40)
## [1] NA

Everyone with a known height is taller than 40 inches, but because Toto’s height is NA R can’t say whether all the heights are bigger than 40.

2.6.4 Practice Exercises

Consider the following vectors:

person <- c("Abe", "Bettina", "Candace", "Devadatta", "Esmeralda")
numberKids <- c(2, 1, 0, 2, 3)
yearsEducation <- c(12, 16, 13, 14, 18)
hasPets <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE)

Think of these vectors as providing information about siz people.

  1. Write a command that produces the names of people who have more than 1 child.

  2. Write a command that produces the numbers of children of people who have a pet.

  3. Write a command that produces the years of education who have at least 13 years of education.

  4. Write a command that produces the names of people who have more than one child and fewer than 15 years od education.

  5. Write a command that produces the names of people who don’t have pets.

  6. Write a command that produces the number of people who have pets.

  7. Write a command that produces the number of people who don’t have pets.

  8. Write a command that says whether or not there is someone who has more 15 years of education and at least one child, but doesn’t have any pets.

2.6.5 Solutions to the Practice Exercises

  1. person[numberKids > 1]

  2. numberKids[hasPets]

  3. yearsEducation[yearsEducation >= 13]

  4. person[numberKids > 1 & yearsEducation < 15]

  5. person[!hasPets]

  6. Here is one way. We’ll learn an easier way in the next section.

    length(person[hasPets])
    ## [1] 3
  7. Here is one way. We’ll learn an easier way in the next section.

    length(person[!hasPets])
    ## [1] 3
  8. any(yearsEducation > 15 & numberKids >= 1 & !hasPets)