6 Simulation

A monkey typing randomly manages to type out a line of Shakespeare.  What is the chance that the monkey would produce this particular line?  Source: http://www.gloryofkings.com/?p=189.

Figure 6.1: A monkey typing randomly manages to type out a line of Shakespeare. What is the chance that the monkey would produce this particular line? Source: http://www.gloryofkings.com/?p=189.

6.1 Probability and Random Variables

Your favorite college basketball team is in the NCAA Tournament. What is the chance that it will make the Final Four?

An insurance company provides flood insurance to homeowners. How much can the company expect to pay out in insurance claims in the coming year?

What is the chance of a large asteroid striking our planet in the next ten years? in the next ten million years?

All of the above questions involve random processes—some of them rather complex. It is the aim of this Chapter to apply our knowledge of vectors, functions and flow control to give approximate answers to questions involving chance processes like these.

Sometimes we will be interested in finding the probability that an event of interest occurs: for example, the event that your team makes the final Four, or the event that the Earth is struck by a large asteroid sometime in the next ten years.

One popular definition of probability goes as follows:

If \(E\) is an event, then the probability that \(E\) occurs is the long-term proportion of times that the event occurs, if we could repeat the random process many, many times.

To be more precise, imagine that you can repeat the random process \(n\) times. (Imagine that your team gets into the NCAA Tournament \(n\) times, for example.) Each time, the event of interest will either occur or not occur. Count the number of times that the event occurs. Then divide by \(n\). You now have the proportion of times that it occurred. Now imagine that \(n\) gets larger and larger. Our intuition says that the proportion of times that the event occurs will stabilize at some number between 0 and 1. This number is the probability that the event occurs. occurs.

We will also concern ourselves with random variables. A random variable is simply a number whose value is determined by the outcome of a chance process. The amount that the insurance company will pay out in the next year is an example of a random variable: its value depends on a complex chance process—how many homeowners experience a flood, how damaging each flood is, etc.

When it comes to random variables, we are often interested in what it might turn out to be on average. That is, suppose we could repeat the random process many, many times—say \(n\) times, where \(n\) is some large number. (Suppose that we could observe the insurance company for many, many years, with each year being the same in terms of how many people are insured, how much their houses are worth, what the climate is like, and so on.) Each time the process is repeated, we get a value for the random variable. We end up with \(n\) values. Now add up these values and divide by \(n\). We have computed their average—the mean, as it is properly called. Now imagine that \(n\) could be larger and larger, without bound.

The expected value of the random variable is the number that this average converges to.

In other words, the expected value of a random variable is what we expect to get on average, in the long run, if we could repeat the random process that produces the value many, many times.

6.2 Monte Carlo Simulation

Our definitions of probability and expected value both involved a limiting notion, namely: what would happen if you could somehow repeat the random process more and more times, without a bound on the number of repetitions. Accordingly, even if we find that we are unable to compute a probability or an expected value exactly with mathematics, we can still attempt to estimate it by making the computer repeat the random experiment many times, keeping track of the result of the experiment each time. This technique is known as Monte Carlo simulation, after the famous Monte Carlo casino in the Principality of Monaco.

In this section we will employ Monte Carlo simulation to estimate probability and expected value in a couple of simple examples.

6.2.1 Estimating a Probability

Consider a box that holds ten tickets. Three of them are labeled with the letter “a”; the rest are labeled “b”:

tickets <- c(rep("a", 3), rep("b", 7))
tickets
##  [1] "a" "a" "a" "b" "b" "b" "b" "b" "b" "b"

We plan to draw one of these tickets at random from the box, with each ticket having the same chance to be the ticket selected. Three of the ten tickets are a’s, so our intuition says that the probability of selecting an “a” ticket is 3 out of 10, or 0.3. Let’s use Monte Carlo simulation to estimate the probability, and see if we get something close to 0.3.

In order to set up the simulation, we need a device for repeating the random process as many times as we would like. At each repetition, the outcome of the chance process should not depend on previous outcomes. We can accomplish this by using R s sample() function on the tickets variable. For example, if we would like to repeat the process of selecting a ticket twenty times, we could write the following code:

sims <- sample(tickets, size = 20, replace = TRUE)
sims
##  [1] "a" "b" "a" "b" "b" "b" "b" "a" "b" "a" "b" "a" "b" "a" "b" "b" "a" "b" "a" "a"

Notice that we set replace to TRUE. This was important, since it guarantees that at each selection there are three “a”’s and seven “b”’s in the box, keeping the chance of getting an “a”-ticket the same each time.

We can summarize the results in a table using R’s table() function:

table(sims)
## sims
##  a  b 
##  9 11

Of course we don’t care so much about the number of times we got an “a”-ticket. Instead we care about the proportion of times we did so. In order to convert the numbers in a table to proportions we can make the table an argument for the function prop.table():

## sims
##    a    b 
## 0.45 0.55

Based on twenty repetitions of the random process, our estimate of the chance an “a”-ticket is 0.25. That’s not so close to the true chance of \(1/3\). In order to obtain a more accurate estimate, we should increase the number of repetitions of the random process. Let’s try again, with ten thousand repetitions:

sims <- sample(tickets, size = 10000, replace = TRUE)
prop.table(table(sims))
## sims
##      a      b 
## 0.2952 0.7048

Our estimate is now a lot closer to the actual value of the probability.

6.2.2 Estimating an Expected Value

Imagine that you are about to play the following game: you will flip a fair coin twice.

  • If you get Tails both times, you lose a dollar.
  • If you get exactly one Head, nothing happens.
  • If you get Heads both times, you win two dollars.

Let \(W\) be the number of dollars you will win.

\(W\) is a clearly a random variable: it’s a number whose value—either -1, 0 or 2—depends on the outcome of the random process of flipping the fair coin twice. What is the expected value of \(W\)?

If we think about the probabilities involved then we can come up with a candidate for the expected value. When you flip a fair coin twice, there are four equally likely outcomes:

  • Tails and then Tails (\(W = -1\))
  • Tails and then Heads (\(W = 0\))
  • Heads and then Tails (\(W = 0\))
  • Heads and then Heads (\(W = 2\))

Hence you have:

  • The chance that \(W=-1\) is 0.25.
  • The chance that \(W=0\) is 0.50.
  • The chance that \(W=2\) is 0.25.

Hence the expected value should be the weighted average:

\[0.25 \times -1 + 0.50 \times 0 + 0.25 \times 2.\] This works out to 0.25, or 25 cents.

We would like to see if Monte Carlo simulation can render an estimate that is close to this value.

For a simple game like ours where there are only a few possible outcomes for the random variable, it is still a good idea to use sample() to simulate the random process of playing the game. We only need to provide a vector of possible values for \(W\) to sample from, and a vector of probabilities for obtaining each of those possible values. The following code illustrates how we might simulate playing the game ten times:

w <- c(-1, 0, 2)
pw <- c(0.25, 0.50, 0.25)
winnings <- sample(w, size = 10, replace = TRUE, prob = pw)
winnings
##  [1] -1  2  0  2  0  0  0  0 -1  2

The Monte Carlo estimate of the expected value of \(W\) is just the average of the winnings in our simulations:

mean(winnings)
## [1] 0.4

Of course, with such a small number of repetitions of the game we cannot expect the Monte Carlo estimate to be very accurate. Let’s try again, but with ten thousand repetitions:

winnings <- sample(w, size = 10000, replace = TRUE, prob = pw)
mean(winnings)
## [1] 0.2435

Yes, that’s much closer to our mathematically-computed value!

By the way, R can easily compute the weighted average itself:

sum(w*pw)
## [1] 0.25

6.2.3 The Law of Large Numbers

In our Monte Carlo simulations so far, we have seen that the more times we repeat the underlying random process, the closer our estimate is likely to be to the actual value, no matter whether we were estimating the probability of an event or an expected value for a random variable.

This is no accident: in fact, it is a consequence of a theorem in the subject of probability that is known as the Law of Large Numbers. We do not possess the mathematical machinery necessary to state the Law precisely—much less prove it—but we can take the rough statement given here as an assurance that the more repetitions we include in our simulation, the more accurate the resulting estimate is liable to be.

6.2.4 Practice Exercises

  1. Consider the following game: you roll a fair die. If you get one or two spots, then you lose a dollar. If you get three, four, or five spots then nothing happens. If you get six spots then you win three dollars. You would like to know how much money you can expect to win per play of the game, on average.

    Write a function called diceGameSim() that will help you answer the question. The function should take the following parameter:

    • reps: the number of times to repeat the simulation. The default value should be 10000.

    The function should return an estimate of the expected value of the game.

  2. You command a Klingon vessel, and you are shooting photon torpedoes at the Starship USS Enterprise. Every torpedo you fire has a strength \(s\) that is a random real number between 0 and 12. If the strength of the torpedo is less than or equal to 5, then the Enterprise’s shields will ward off any damage. However, if the strength \(s\) exceeds 5, then the damage inflicted on the Enterprise will be \(12-s\) units. What is the expected amount of damage done to the Enterprise, per shot? Write a function called photonTorpedoSim() that estimates the expected damage. The function should take the following parameter:

    • reps: the number of times to repeat the process of firing at the Enterprise. The default value should be 10000.

    The function should return the estimated expected value.

6.2.5 Solutions to the Practice Exercises

  1. Here’s the code:

    diceGameSim <- function(reps) {
      w <- c(-1, 0, 3)
      pw <- c(2/6, 3/6, 1/6)
      winnings <- sample(w, size = reps, 
                         replace = TRUE,
                         prob = pw)
      mean(winnings)
    }

    Let’s try it out:

    set.seed(3838)
    diceGameSim(reps = 100000)
    ## [1] 0.16548

    Our estimate, based on one hundred thousand repetitions of the game, is that you can expect to win about 16.55 cents per game. (In fact the exact expected value is \(1/6\) of a dollar.)

  2. Here’s the code, if you use the pmax() function:

    photonTorpedoSim <- function(reps) {
      strengths <- runif(reps, min = 0, max = 12)
      damage <- pmax(0, strengths - 5)
      mean(damage)
    }

    Let’s try it out:

    set.seed(3838)
    photonTorpedoSim(reps = 100000)
    ## [1] 2.043227

    You can expect to do about 2.04 points worth of damage per shot. (In fact the exact expected value is \(49/24 \approx 2.04167\).)

    Here’s the code if you used a loop instead:

    photonTorpedoSim <- function(reps) {
      damage <- numeric(reps)
      for (i in 1:reps) {
        strength <- runif(1, min = 0, max = 12)
        if (strength > 5) {
          damage[i] <- strength - 5
        } else {
          damage[i] <- 0
        }
      }
      mean(damage)
    }

6.3 Example: Chance of a Triangle

In the examples considered so far, intuition or a simple calculation suggests what the exact value of the probability or expected value should be. The true power of the Monte Carlo simulation method shines forth in situations where it is difficult or impossible to compute a value mathematically.

Let’s revisit an example from Section 4.2.4, where we considered whether or not three segments could be put together to form a triangle. We developed the following function to determine, from the lengths of each segment, whether or not a triangle is possible:

isTriangle <- function(x, y, z) {
  (x + y > z) & (x +z > y) & (y + z > x)
}

Recall also that this function applied to vectors of any length, so we can use it to decide about many triples of segments at once.

Let’s now inject some chance variation into the situation. Suppose that you have a stick that is one unit in length: one foot, one meter, one yard—whatever. You plan to break the stick at two randomly selected points. Breaking the stick will yield three pieces. You wonder: what is the chance that these three pieces can form a triangle?

In an advanced probability course you can show that the probability of a triangle is exactly \(1/4\). Here we would like to estimate the probability with Monte Carlo simulation.

In order to carry out the simulation, we will need to develop code that produces a pair of break-points, any number of times that we like—let’s think about ten times, to start.

This is not so difficult if we use the runif() function:

x <- runif(10)  # the first breaks
y <- runif(10)  # the second breaks

Next we must compute—for each pair of breaks—the lengths of the three segments that are produced. If we knew that the x break was always less than the y break, this would be easy:

  • the leftmost piece would be x units long;
  • the middle piece would be y minus x units long;
  • the rightmost piece would be 1 minus y units long.

The problem is that a given element of the vector y can easily be less than than corresponding element of the vector x. When that happens, the pieces won’t be as described in the bullet-ed items above.

We can solve this problem with the pmin() and pmax() functions that were introduced in Section 2.7.1.5:

a <- pmin(x, y)
b <- pmax(x, y)
side1 <- a     # leftmost
side2 <- b - a  # middle
side3 <- 1 - b # rigttmost

It is useful to write a helper-function that starts from the breaks, computes the sides, and then decides whether they form a triangle:

makesTriangle <- function(x, y) {
  a <- pmin(x, y)
  b <- pmax(x, y)
  side1 <- a
  side2 <- b - a
  side3 <- 1 - b
  isTriangle(x = side1, y = side2, z = side3)
}

Let’s try it our for our repetitions

Table 6.1 shows the results. Based on our ten repetitions, we would estimate the probability of a triangle as \(4/10\).

Table 6.1: Results of ten repetitions of breaking a unit length at two random points to form three segments.
x y side1 side2 side3 triangle
0.6469028 0.7644140 0.6469028 0.1175112 0.2355860 FALSE
0.3942258 0.7438358 0.3942258 0.3496100 0.2561642 TRUE
0.6185018 0.8261657 0.6185018 0.2076639 0.1738343 FALSE
0.4768911 0.4227291 0.4227291 0.0541621 0.5231089 FALSE
0.1360972 0.4092877 0.1360972 0.2731905 0.5907123 FALSE
0.0673844 0.5396926 0.0673844 0.4723082 0.4603074 TRUE
0.1291526 0.9607224 0.1291526 0.8315698 0.0392776 FALSE
0.3931179 0.6535573 0.3931179 0.2604394 0.3464427 TRUE
0.0025827 0.5467153 0.0025827 0.5441326 0.4532847 FALSE
0.6202060 0.2660636 0.2660636 0.3541424 0.3797940 TRUE

Interestingly, R can compute the proportion of times that a triangle was formed, directly from the logical vector triangle. Look at the following code:

sum(triangle)
## [1] 4
mean(triangle)
## [1] 0.4

When we ask R to add the elements of a logical vector, it coerces the Boolean values to numbers: TRUE becomes 1 and FALSE becomes 0. Summing the numbers is then equivalent to counting how many times TRUE appeared in triangle. Taking the mean is equivalent to computing the proportion of TRUEs in triangle.

We would like, of course, to simulate breaking the unit length many times, and we would like to be able to choose easily—without a lot of fuss in coding—the number of repetitions. It makes sense, therefore, to write a simulation function that will do our work for us:

triangleSim <- function(reps = 10000 ) {
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

Now that we think about it, it might also be nice to have the option to view a table of the results, along with the estimate of the probability:

triangleSim <- function(reps = 10000, table = FALSE ) {
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n")
    print(table(triangle))
    cat("\n")
  }
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

Let’s give our function a try, using the default of ten thousand repetitions and asking for a table of results:

triangleSim(table = TRUE)
## Here is a table of the results:
## triangle
## FALSE  TRUE 
##  7466  2534 
## 
## The proportion of triangles was 0.2534.

As we would expect, our estimate of the probability of a triangle is pretty close to \(1/4\), the value that is known to experts in probability.

6.3.1 Setting a Seed

There are further possibilities for refining the simulation function. We know that it can be a good idea to set a seed for R’s random-number generator. When you do this you still get random-looking results, but they will be the same results no matter how often you call the simulation function. In that way others who have access to your function can “reproduce” the results you got, thus assuring themselves that you weren’t making anything up. Let’s add a seed parameter to our simulation function:

triangleSim <- function(reps = 10000, table = FALSE, seed ) {
  set.seed(seed)
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(triangle))
    cat("\n")
  }
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

This is all very well, but if for some reason you desire to have the function produce different results with every call, you have to come up with the seed yourself. It would be nice to have the option to call your function without having to provide a seed.

This is not difficult to accomplish. We’ll set the default value for seed to be NULL. Then when the function begins we’ll check the value of seed. If it is not NULL, then the user has provided a seed, and we’ll use that. Otherwise, we’ll let R grab pseudo-random numbers however it likes.

triangleSim <- function(reps = 10000, table = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(triangle))
    cat("\n")
  }
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

6.3.2 Program Development

Think about how we developed the triangleSim() function.

  1. We began by writing simple code that got us a simulation for a few repetitions.
  2. When we were sure how to make a simulation work, we encapsulated our code into a function that we could call to perform the simulation for us.
  3. We though about features that we would like the function to have—user chooses number of repetitions, user chooses whether to see a table, etc.—and implemented these features one by one. This made the function more general, i.e., useful in a wider range of settings.

The above method of program development is called encapsulation and generalization. For small projects of the sort that we have on hand, it’s a good way to go about writing a computer program.

6.3.3 Number of Repetitions

The Law of Large Numbers says that the more times you repeat the simulation, the better our estimate of the desired probability—or expected value—is liable to be. Let’s use the Stick-Splitting Problem to investigate the effect of the choice of repetitions on the accuracy of our estimate of the probability of getting a triangle.

We will need to run the triangleSim() function several times, with a different choice for reps each time, keeping track of the estimates we obtain. In the process we won’t need tables or other output to the console. Let’s rewrite triangleSim() for maximum flexibility, making console output optional and using the invisible return discussed in Section 4.2.3.17

triangleSim <- function(reps = 10000, table = FALSE, seed = NULL,
                        verbose = TRUE) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(triangle))
    cat("\n")
  }
  if ( verbose ) {
    cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
  }
  invisible(mean(triangle))
}

Next, we make a vector of repetitions, and a vector to hold our estimates:

reps <- c(100, 1000, 10000, 100000, 1000000)
estimates <- numeric(length(reps))

Next we loop through reps, performing a complete simulation for each repetition-number and storing our results:

for ( i in 1:length(reps)) {
  estimates[i] <- triangleSim(reps = reps[i], seed = 3030,
                             verbose = FALSE)
}

Let’s have a look at our estimates:

names(estimates) <- reps
estimates
##      100     1000    10000    1e+05    1e+06 
## 0.200000 0.245000 0.254700 0.251570 0.249779

The estimates are indeed getting closer to the theoretical value of \(1/4\).18

6.3.4 Practice Exercises

  1. Rewrite the diceGameSim() function from the previous section so that, just as we did with the triangleSim() function in this section:

    • it allows the user to set a seed;
    • it returns the estimated expected value invisibly, and has a verbose option to cat the estimate out to the Console.
  2. Rewrite the photonTorpedoSim() function from the previous section so that, just as we did with the triangleSim() function in this section:

    • it allows the user to set a seed;
    • it returns the estimated expected value invisibly, and has a verbose option to cat the estimate out to the Console.
  3. Extend the photonTorpedoSim() just a bit more, by adding two new parameters:

    • shield: the level at which the Enterpise’s shields are set. (If the strength of the torpedo is less than shield, then there is zero damage.) The default value of shield should be 5.
    • upper: the upper bound on the strenght of a torpedo-shot. (That is, a tropedo shot will be a random real number from 0 to upper.) The default value of upper should be 10.

    The program should stop with a helpful message if the user sets upper or shield to a negative number.`

  4. Abe and Bo engage in a duel. They take turns shooting at each other until one person manages to hit the other, thus winning the dual. Abe has a 40 percent chance of hitting with each shot he takes, whereas Bo has a 60 percent chance of hitting on each of his shots. Since Abe is the weaker shooter, Bo generously permits Abe to go first. We would like to estimate Abe’s chance of winning the duel. Write a function called dual() to estimate this probability. The function should have two parameters:

    • reps: the number of times to simulate a dual. The default should be 10,000.
    • seed: a seed that the user can set. The default value should be NULL.

6.3.5 Solutions to Practice Exercises

  1. Here’s the code:

    diceGameSim <- function(reps, seed = NULL, verbose = TRUE) {
      if (!is.null(seed) ) {
        set.seed(seed)
      }
      w <- c(-1, 0, 3)
      pw <- c(2/6, 3/6, 1/6)
      winnings <- sample(w, size = reps, 
                         replace = TRUE,
                         prob = pw)
      if ( verbose ) {
        cat("The expected amount to win is about ",
            mean(winnings), ".\n", sep = "")
      }
      invisible(mean(winnings))
    }

    Let’s try it out:

    diceGameSim(reps = 100000, seed = 5252)
    ## The expected amount to win is about 0.17503.
  2. Here’s the code, if you use the pmax() function:

    photonTorpedoSim <- function(reps, seed = NULL, verbose = TRUE) {
      if (!is.null(seed) ) {
        set.seed(seed)
      }
      strengths <- runif(reps, min = 0, max = 12)
      damage <- pmax(0, strengths - 5)
      if ( verbose ) {
        cat("The expected damage is about ",
            mean(damage), " units.\n", sep = "")
      }
      invisible(mean(damage))
    }

    Let’s try it out:

    photonTorpedoSim(reps = 100000, seed = 4040)
    ## The expected damage is about 2.034732 units.
  3. Here’s the code:

    photonTorpedoSim <- function(reps, seed = NULL, verbose = TRUE,
                                 shield = 5, upper = 10) {
      if (!is.null(seed) ) {
        set.seed(seed)
      }
      if ( upper < 0 ) {
        return(cat("Upper bound on torpedo strength cannot be a negative number!"))
      }
      if ( shield < 0 ) {
        return(cat("Shield strength cannot be set to a negative number!"))
      }
      strengths <- runif(reps, min = 0, max = upper)
      damage <- pmax(0, strengths - shield)
      if ( verbose ) {
        cat("The expected damage is about ",
            mean(damage), " units.\n", sep = "")
      }
      invisible(mean(damage))
    }

    Let’s try it out:

    photonTorpedoSim(reps = 100000, seed = 4040,
                     upper = 15, shield = 12)
    ## The expected damage is about 0.2991273 units.
  4. Here’s the function:

    dual <- function(reps = 10000, seed = NULL) {
      if (!is.null(seed)) {
        set.seed(seed)
      }
      # set up logical vector to store results of each simulation:
      abeWins <- logical(reps)
      for (i in 1:reps) {
        dualing <- TRUE
        shotNumber <- 1
        while (dualing) {
          # if i is an odd number, then Abe is shooting:
          abeToShoot <- shotNumber %% 2 == 1
          # probability of hitting depends on who is shooting:
          hitProb <- ifelse(abeToShoot, 0.4, 0.6)
          # shooter takes a shot:
          resultOfShot <- sample(c("hit", "miss"),
                                 size = 1, 
                                 prob = c(hitProb, 1 - hitProb))
          if (resultOfShot == "hit") {
            abeWins[i] <- ifelse(abeToShoot, TRUE, FALSE)
            dualing <- FALSE
          }
          shotNumber <- shotNumber + 1
        }
      }
      # count up how many times abe won, divide by number of plays,
      # and return:
      sum(abeWins)/reps
    }

    Let’s try it out:

    dual(reps = 100000, seed = 3939)
    ## [1] 0.52441

    Looks like Abe has a slight advantage!

6.4 Example: Will They Connect?

Anna and Raj make a date for coffee tomorrow at the local Coffee Shop. After making the date, both of them forget the exact time they agreed to meet: they can only remember that it was to be sometime between 10 and 11am. Each person, independently of the other, randomly picks a time between 10 and 11 to arrive. If Anna arrives and Raj is not there, she will wait up to ten minutes for him, but will leave if he does not show within that time period. Raj is similarly disposed: he will wait ten minutes—but no more—for Anna. What is the chance that they meet?

The key to designing a simulation procedure is to realize that Anna and Raj will connect if and only if the difference between their arrival times is less than 10 minutes. It doesn’t matter who arrived first, so long as the difference is less than 10. This means that we want to check on the absolute value of Anna’s arrival time minus Raj’s arrival time. The following code implements this idea:

meetupSim <- function(reps = 10000, table = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  anna <- runif(reps, 0, 60)
  raj <- runif(reps, 0, 60)
  connect <- (abs(anna - raj) < 10)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(connect))
    cat("\n")
  }
  cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}

Let’s try it out:

meetupSim(reps = 100000, table = TRUE, seed = 3939)
## Here is a table of the results:
## 
## connect
## FALSE  TRUE 
## 69781 30219 
## 
## The proportion of times they met was 0.30219.

6.4.1 Vectorization vs. Looping

Whenever you perform a simulation you have the option to use a for-loop, running through the loop once for each repetition of the random process in question. For example, you could rewrite the meetup-simulation as follows:

meetupSim2 <- function(reps = 10000, table = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  #create an empty vector to hold the results:
  connect <- numeric(reps)
  # loop through:
  for ( i in 1:reps ) {
    # get one arrival time for anna:
    anna <- runif(1, 0, 60)
    # and one for raj:
    raj <- runif(1, 0, 60)
    #compute result and record in conncect:
    connect[i] <- (abs(anna - raj) < 10)
  }
  #the rest is the same as in meetupSim:
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(connect))
    cat("\n")
  }
  cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}

Be aware, though, that when you use a loop in place of vectorization your routine is liable to run more slowly. We can measure the difference with the R-function system.time(). As an illustration, let’s compare the running times for the vectorized and the looping versions of the meetup simulation.

First we get the time for the thousand repetitions of the meetup, using the original vectorized function:

system.time(meetupSim(reps = 10000, seed = 4040))
## The proportion of times they met was 0.3079.
##   user  system elapsed 
##  0.002   0.001   0.003 

Our concern is with the total elapsed time: a mere three-thousandths of a second!

Next we get the time for the same number of repetitions, using the looping implementation:

system.time(meetupSim2(reps = 10000, seed = 4040))
## The proportion of times they met was 0.2993.
##    user  system elapsed 
##  0.056   0.007   0.065 

Here the elapsed time is 0.065 seconds. This still seems pretty fast, but it’s more than 20 times as long as for the vectorized simulation. For more complex simulations such a dramatic slowdown could pose serious practical problems.

When performance is an issue, prefer vectorization to looping as much as possible.

6.5 Example: the Appeals Court Paradox

The following example is discussed in (J.Nahin 2008).

An appeals court generally consists of an odd number of justices, so the court will not be hampered with tie-votes and will thus always be able to render a decision for any case brought before it. Let’s imagine an appeals court that has five members. Let us further imagine that for each case before the court, each of the justices, independently of the others, makes a decision for either one or the other of the two opposing parties. Assume also for the sake of simplicity that there is always one side that is “in the right.” It follows that each judge is making a decision that is either correct or incorrect. The judges then report their decisions to one another, and the decision of the court is determined by majority vote. To be precise: whichever side gets three or more votes is the side that wins.

Although all of the members of the court are pretty sharp, they differ somewhat in their legal abilities:

  • In any case that comes before her, Judge A has a 95% chance to decide correctly.
  • Judge B has a 94% chance to judge correctly.
  • Judges C and D each have a 90% chance to judge correctly.
  • Judge E is the weak link on the court, with a mere 80% chance to judge correctly.

We are interested in estimating the probability that the majority opinion of the court will be correct.

Before we write a simulation in full, we should consider how we are going to simulate something that has a specified percentage chance of happening. For example, how would we simulate decisions made by Judge A?

One approach is to use runif() with a cutoff:

number <- runif(1) # random number between 0 and 1
number < 0.95
## [1] FALSE

In repeated trials, a random real number between 0 and 1 will turn out to be less than 0.95 about 95% of the time.

Another—and more convenient—way is to use the function rbinom(). This function simulates the results of flipping a coin a number of times and counting how many heads one gets. The general form of a call to rbinom()is as follows:

rbinom(n, size, prob)

In this call:

  • size is how many times you plan to flip the coin;
  • prob is the chance on any flip that the coin will turn up heads;
  • n is how many times you plan to repeat the process of flipping the coin size times (counting up the number of heads each time).

In order to simulate flipping a fair coin 100 times, you could ask for:

rbinom(n = 1, size = 100, prob = 0.5)
## [1] 52

It seems that we got 52 heads!

If you want to go through the above process twenty times instead of just doing it once, you could ask for:

rbinom(n = 20, size = 100, prob = 0.5)
##  [1] 52 49 45 57 53 60 46 53 50 54 59 47 42 45 41 56 47 57 56 49

Nothing in rbinom() says that it addresses only coin-flipping. In general it serves as a model for any situation in which:

  • there is a fixed number of trials (e.g., flip of a coin, taking a free-throw shot, deciding about a court case);
  • each trial has two possible outcomes, often termed Success or Failure (e.g., coins lands heads or tails, you make the shot or you don’t, you decide the case correctly or you don’t);
  • the chance of the Success outcome is the same, on any trial (e.g., the coin always has the same chance of coming up heads, for any case Judge A has a 95% chance of being right);
  • the outcome of any trial is independent of the outcome of all other trials (e.g., making the first free throw does not increase or decrease one’s chances of making the next free throw, etc.)
  • you are counting up the number of successes in your trials.

In such a situation the count of the number of successes is called a binomial random variable.19 The size parameter in rbinom() gives the fixed number of trials. The prob parameter specifies the chance of success on each trial.

In this example we would like to simulate each judge deciding about each case. Suppose for example, that there are twenty cases. We would like to know, for each judge and each case, whether or not the judge was correct

Suppose we were to watch Judge A making 20 decisions, recording each time whether she was correct or not. In order to simulate the results with R, we could ask for:

rbinom(n = 20, size = 1, prob = 0.95)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1

In the above results, each 1 stands for a correct decision and the 0 stands for an incorrect decision.

Note that the following call would be incorrect:

rbinom(n = 1, size = 20, prob = 0.95)

This would give us a count of the number of times Judge A was correct, but it would not tell us which of the cases she judged correctly and which she judged incorrectly.

Assuming that the court hears 20 cases, we would like to simulate the decision of each judge in each case. We could store the results in variables, as follows:

a <- rbinom(n = 20, size = 1, prob = 0.95)  # Judge A
b <- rbinom(n = 20, size = 1, prob = 0.94)  # Judge B
c <- rbinom(n = 20, size = 1, prob = 0.90)  # Judge C
d <- rbinom(n = 20, size = 1, prob = 0.90)  # Judge D
e <- rbinom(n = 20, size = 1, prob = 0.80)  # Judge E

Now we come to the key idea in coding our simulation: in order to determine the total number of correct votes in each case, all we need to do is add the five vectors we created above:

correctVotes <- a + b + c + d + e

The court will decide correctly when the number of votes for the correct option is at least 3. The following logical vector records this:

courtCorrect <- (correctVotes >= 3)

Table 6.2 shows some possible results:

Table 6.2: Results of 20 simulated appeals court decisions.
A B C D E Number Correct Court Correct
1 1 1 0 0 3 TRUE
1 1 1 0 1 4 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 1 5 TRUE
1 1 1 0 1 4 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 0 4 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 1 5 TRUE
1 1 1 0 1 4 TRUE
1 1 0 1 0 3 TRUE
1 1 1 1 1 5 TRUE
1 1 1 0 1 4 TRUE
0 1 1 1 1 4 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 1 5 TRUE
1 1 1 1 0 4 TRUE
1 1 0 1 1 4 TRUE

Note that all of the court’s decisions were correct, even though in a couple of cases the correct decisions were barely in the majority.

We now have the basic idea of the simulation. In order to estimate the probability of a correct decision, we simply need to recast our idea in the form of a function that permits us to simulate a very large number of imaginary court cases and that will report the results to us. Here is the code for such a function.

courtSim <- function(reps = 10000,
                     seed = NULL,
                     table = FALSE,
                     probs = c(0.95, 0.94, 0.90, 0.90, 0.80)) {
  
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  
  # get the probabilities
  aProb <- probs[1]
  bProb <- probs[2]
  cProb <- probs[3]
  dProb <- probs[4]
  eProb <- probs[5]
  
  # simulate decisions of each judge:                 
  a <- rbinom(n = reps, size = 1, prob = aProb)
  b <- rbinom(n = reps, size = 1, prob = bProb)
  c <- rbinom(n = reps, size = 1, prob = cProb)
  d <- rbinom(n = reps, size = 1, prob = dProb)
  e <- rbinom(n = reps, size = 1, prob = eProb)
  
  # count the number of correct votes in each case:
  correctVotes <- a + b + c + d + e
  
  # determine whether court decided correctly, in each case:
  courtCorrect <- (correctVotes >= 3)
  
  # record results
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(courtCorrect))
    cat("\n")
  }
  cat("The proportion of times the court was correct was ", 
      mean(courtCorrect), ".\n", sep = "")
}

Let’s now estimate the probability of the court rendering a correct verdict, using one hundred thousand simulated cases:

courtSim(reps = 100000, seed = 3838, table = TRUE)
## Here is a table of the results:
## 
## courtCorrect
## FALSE  TRUE 
##   787 99213 
## 
## The proportion of times the court was correct was 0.99213.

The court seems to be doing quite well! But of course this is not very surprising: after all, most of the judges make the correct decision almost all the time. It is interesting, though, that the chance of the full court rendering a correct verdict is higher than the chance of any individual judge to be correct. There appears to be a benefit to the voting procedure.

But things get even more interesting if we imagine that the judges recognize, after some time, that Judge E simply isn’t as sharp as Judge A, and that they pressure him into voting whichever way Judge A votes. “Surely,” they reason, “since Judge A is so good this new policy will increase our chance to hand down a correct verdict!”

We will simulate to see if the judges’ reasoning is correct. All we need to do is to count Judge A’s vote twice in the sum, and not count Judge E’s vote at all:

correctVotes <- 2*a + b + c + d

Here is the new simulation-function:

courtSim2 <- function(reps = 10000,
                     seed = NULL,
                     table = FALSE,
                     probs = c(0.95, 0.94, 0.90, 0.90, 0.80)) {
  
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  
  # get the probabilities
  aProb <- probs[1]
  bProb <- probs[2]
  cProb <- probs[3]
  dProb <- probs[4]
  eProb <- probs[5]
  
  # simulate decisions (this time, no need for Judge E) :             
  a <- rbinom(n = reps, size = 1, prob = aProb)
  b <- rbinom(n = reps, size = 1, prob = bProb)
  c <- rbinom(n = reps, size = 1, prob = cProb)
  d <- rbinom(n = reps, size = 1, prob = dProb)
  
  # count rhe number of correct votes in each case:
  correctVotes <- 2*a + b + c + d
  
  # determine whether court decided correctly, in each case:
  courtCorrect <- (correctVotes >= 3)
  
  # record results
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(courtCorrect))
    cat("\n")
  }
  cat("The proportion of times the court was correct was ", 
      mean(courtCorrect), ".\n", sep = "")
}

Let’s try it out:

courtSim2(reps = 100000, seed = 3838, table = TRUE)
## Here is a table of the results:
## 
## courtCorrect
## FALSE  TRUE 
##  1221 98779 
## 
## The proportion of times the court was correct was 0.98779.

Hey! That’s a lower chance of success than before! The difference is small but significant: compelling the weakest judge to vote with the strongest judge actually decreased the court’s chance of rendering a correct verdict overall. This circumstance is sometimes called the Appeals Court Paradox, but it occurs in many other practical situations. More often than you might think, the benefit of independent voting can outweigh the advantage associated with relying solely on a small number of “experts.”

6.5.1 Practice Exercises

  1. Use rbinom() to simulate taking twenty free throws, where you have a 70% chance to make each shot. Your simulation should produce a vector of 20 ones and zeros, where 1 stands for making the shot and 0 stands for missing it.

  2. Use rbinom() to simulate taking twenty free throws, where you have a 70% chance to make each shot, and counting up how many shots you make. Your simulation should produce a vector of length one that represents the number of shots you made.

  3. Every morning for 100 mornings in a row you take twenty free throws. You have a 70% chance to make each shot, and you always count up how many shots you make. Use rbinom() to simulate the counts. Your simulation should produce a vector of length 100 where the first element represents the number of shots you made on day 1, the second element represents the number of shots you made on day 2, etc.

  4. You have a 70% chance to make each free throw shot you take. Your friend Lester has a 60% chance to make each shot. Both of you will take 20 shots. You are interested in knowing the probability that you will make more shots than Lester does. Write a function called freeThrowSim() that will help you estimate this probability. The function should take two parameters:

    • seed, so that the user can set a seed;
    • reps, the number of times to simulate a competition between you and Lester.

    The function should return the estimated probability.

6.5.2 Solutions to the Practice Exercises

  1. Try this:

    rbinom(20, size = 1, prob = 0.70)
    ##  [1] 0 0 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 1 1 1
  2. Try this:

    rbinom(1, size = 20, prob = 0.70)
    ## [1] 14
  3. Try this:

    rbinom(100, size = 20, prob = 0.70)
  4. Here’s the function if you decide to use a loop:

    freeThrowSim <- function(reps, seed) {
      youWin <- logical(reps)
      for ( i in 1:reps ) {
        youMake <- rbinom(1, size = 20, prob = 0.70)
        lesterMakes <- rbinom(1, size = 20, prob = 0.60)
        youWin[i] <- (youMake > lesterMakes)
      }
      mean(youWin)
    }

    Here’s the function if you vectorize:

    freeThrowSim <- function(reps, seed) {
      youMake <- rbinom(reps, size = 20, prob = 0.70)
      lesterMakes <- rbinom(reps, size = 20, prob = 0.60)
      youWin<- (youMake > lesterMakes)
      mean(youWin)
    }

    Let’s try it:

    freeThrowSim(reps = 100000, seed = 3535)
    ## [1] 0.69132

    You have nearly a 70% chance to make more shots than Lester.

6.6 Example: How Many Numbers Needed?

Let us pause for a review.

So far our examples have involved estimating a probability: the chance for a particular event to occur. We have accomplished this with Monte Carlo simulation:

  • We repeated the chance process a very large number of times
  • On each repetition, we recorded whether or not the event occurred.
  • After we finished all of the repetitions, we computed the proportion of those repetitions in which the event did occur.
  • This proportion served as our estimate of the probability for the event to occur.
  • We know that the more repetitions we arrange for, the closer our estimate is liable to be to the actual probability for the event to occur.

Let us now switch gears and take up the task of estimating the expected value of a random variable.

Recall that a random variable is a number whose value is the result of some chance process. The expected value of the random variable is the average—the mean—of its values over many, many repetitions of that chance process.

Let us consider the following chance process:

  • Pick a real number \(x_1\) between 0 and 1 at random.
  • Pick another real number \(x_2\) at random (also between 0 and 1). Add it to \(x_1\).
  • If the sum \(x_1 + x_2\) exceeds 1, then stop.
  • Otherwise, pick a third real number \(x_3\), and add it to the sum of the previous two.
  • If the sum \(x_1+x_2+x_3\) exceeds 1, then stop.
  • Otherwise, pick a fourth real number …

The idea is to keep on going in this way until the sum of the numbers you have picked exceeds 1.

The random variable we are interested in is the number of numbers we have to pick to make the sum of those numbers exceed 1. Let us call this random variable \(X\). As you can see, this number could be 2, or 3, or 4, or even more depending on how small the random numbers you pick happen to be.

In order to fully grasp what is going on, it will be helpful to go through the process of computing \(X\). The following function will do this for us:

numberNeeded <- function(verbose = FALSE) {
  mySum <- 0
  count <- 0
  while( mySum < 1 ) {
    number <- runif(1)
    mySum <- mySum + number
    count <- count + 1
    if ( verbose ) {
      cat("Just now picked ", number, ".\n", sep = "")
      cat(count, " number(s) picked so far.\n", sep = "")
      cat("Their sum is ", mySum, ".\n\n", sep = "")
    }
  }
  count
}

Call the function a few times, like this:

numberNeeded(verbose = TRUE)
## Just now picked 0.2869393.
## 1 number(s) picked so far.
## Their sum is 0.2869393.
## 
## Just now picked 0.7092682.
## 2 number(s) picked so far.
## Their sum is 0.9962075.
## 
## Just now picked 0.7524099.
## 3 number(s) picked so far.
## Their sum is 1.748617.
## [1] 3

As you can see, it’s quite common to need 2 or 3 numbers to make the sum exceed 1, but sometimes, you need 4, 5 or even more numbers.

Since we are interested in the expected value of \(X\), we should repeat the process a large number of times and compute the mean of the \(X\)-values that we find. The following code repeats the process 1000 times:

needed <- numeric(1000)
for (i in 1:1000 ) {
  needed[i] <- numberNeeded()
}
cat("The expected number needed is about ", 
    mean(needed), ".\n", sep = "")
## The expected number needed is about 2.697.

Of course the expected value indicates what \(X\) will be on average. By itself it doesn’t tell us about the variability in \(X\)—how much \(X\) tends to “hop around” from one repetition to another. In order to get a sense of the variability of \(X\), we can make a table of the results of our 1000 repetitions:

table(needed)
## needed
##   2   3   4   5   6   7 
## 528 303 123  37   8   1

Sure enough, we usually got 2 or 3, but higher values are certainly possible.

For the most part we are less interested in the number of times we got each possible value than we are in the proportion of times we got each possible value. For this all we need is prop.table():

prop.table(table(needed))
## needed
##     2     3     4     5     6     7 
## 0.528 0.303 0.123 0.037 0.008 0.001

The proportions serve as estimates of the probability of getting each of the possible values of \(X\). Such a table of proportions gives us a sense of the distribution of \(X\)—the chances for \(X\) to assume each of its possible values.20

Now that we have the basic idea of how to accomplish the simulation, it remains to package up our code into a nice function that will enable a user to set the number of repetitions, set a seed, and choose reporting options. Along the way we will generalize a bit, allowing the user to set a target sum other than 1. We’ll also return our estimate of the expected number needed invisibly, so that the function can be used in the context of larger-scale investigations (see the exercises from this Chapter).

numberNeededSim <- function(target = 1, reps = 1000, 
                            seed = NULL, report = FALSE) {
  
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  
  # define helper function
  numberNeeded <- function(target) {
    mySum <- 0
    count <- 0
    while( mySum < target ) {
      number <- runif(1)
     mySum <- mySum + number
      count <- count + 1
    }
    count
  }
  
  #perform simulation
  needed <- numeric(reps)
  for (i in 1:reps ) {
    needed[i] <- numberNeeded(target)
  }
  
  # report results (if desired)
  if ( report ) {
    print(prop.table(table(needed)))
    cat("\n")
    cat("The expected number needed is about ",
        mean(needed), ".\n", sep = "")
  }
  # return estimate of the expected number needed:
  invisible(mean(needed))
}

Let’s test the function with ten thousand repetitions:

numberNeededSim(target = 1, reps = 10000, seed = 4848, report = TRUE)
## needed
##      2      3      4      5      6      7      8 
## 0.4937 0.3350 0.1313 0.0317 0.0071 0.0010 0.0002 
## 
## The expected number needed is about 2.7273.

Does the estimate of the expected number of numbers look a bit familiar? Mathematicians have proven that the exact expected value—what you would get for an average if you could do more and more repetitions, without limit—is the famous number \(e\):

\[\mathrm{e} \approx 2.71828 \ldots\] In the exercises you will be asked to investigate the expected value when the target is a number other than 1.

6.7 Example: Dental Floss

If you are like me, you don’t like to run out of an item that you commonly use, so you always try to have a spare source for the item on hand. Take dental floss, for example. It’s difficult to know when the little box is almost out of floss. Hence I try to keep two boxes on hand in my bathroom drawer: the one I’m using, and a spare. The problem is that I can never tell which one is “the one I’m using” and which one is the spare. In fact on any given day I’m equally likely to pull a length of floss out of either one of the boxes.21

Let’s make the following assumptions:

  • A box of floss starts out with 150 feet of floss.
  • I start with two fresh boxes.
  • Every time I floss, each box has a 50% chance of being the box I pull from.
  • When I pull out floss, the length I choose to pull is a random number between 1 and 2 feet. If the box has less than than the amount I choose to pull, then:
    • If the remaining length is more than 1 foot, I pull it all out and use it. (Next time I’ll have to use the other box.)
    • If the remaining length is less than 1 foot, I also pull it all out, but I don’t use it because it’s not long enough. (I’ll have to move on to the other box.)

There are several interesting questions we could ask concerning this situation. In this example, let’s focus on finding the expected value of \(X\), the amount of floss left in the other box when I discover the one of the boxes is exhausted.

I would also like to get some idea of the distribution of \(X\)—the chance for it to assume each of its possible values. Unlike the previous “Number of Numbers” example, however, \(X\) is what we call a continuous random variable: it can assume any of the values in a range of real numbers. After all, the amount left in the “other” box could be anything from 0 to 150 feet: it could be 3.5633 feet or 5.21887 feet, or any real number at all, so long as it’s between 0 and 150. Although any real number value—3.5633 feet, for instance—is possible, it seems that the chance of \(X\) being exactly 3.5633 is vanishingly small, and the same goes for any of the possible real numbers value for \(X\). Hence it seems that the distribution of \(X\) cannot be described well with a table of the values that we actually get in a simulation: indeed, we are highly unlikely to get any value more than once in the course of our simulation.

One way to visualize the distribution of a continuous random variable is to make a density plot of a large, representative sample of its values. Let’s practice making a density plot for a random variable that R provides for us: a so-called normal random variable. Try this code:

x <- rnorm(100, mean = 70, sd = 3)

We have just asked R to produce 100 numbers chosen randomly from a normal distribution. The mean parameter sets what the values should be “on average,” and the sd parameter indicates how much they should typically differ from that average value. The bigger the sd, the bigger the difference between the mean and value actually obtained is liable to be.

Let’s get a look at the first ten values in the vector x:

x[1:10]
##  [1] 71.13092 70.90465 66.70593 66.60878 61.61040 72.16172 72.81736 69.31187 75.27739
## [10] 70.35210

Sure enough, the numbers are typically around the mean of 70, give or take 3 units or so.

We can use the ggplot2 package to produce a density curve from all 100 values in the vector x. (The resulting plot appears as Figure 6.2.)

p <- ggplot(mapping = aes(x = x)) + geom_density() + geom_rug() +
  labs(x = "x-value")
p
Density plot of 100 random values from a normal distribution with mean 70 and standard deviation 3.

Figure 6.2: Density plot of 100 random values from a normal distribution with mean 70 and standard deviation 3.

The geom_density() part of the plotting-command produces the curve. The geom_rug() part produces tick-marks along the horizontal axis, each of which is located at a particular value in the vector x. You can see that the curve is higher where values are crowded together, and lower where the values are spread apart. The scale on the y-axis has been chosen carefully so that the area under the density curve between any two numbers \(a\) and \(b\) on the axis gives an estimate of the chance that the normal random variable would produce a value between \(a\) and \(b\). In this way the density curve gives us a visual sense of the distribution of the random variable: it seems rather likely to be between 67 and 73 (most of the area under the curve occurs there) and rather unlikely to be greater than 76, and so on.

We are now ready to simulate the dental-floss scenario. Since we have worked through the encapsulation-and-generalization development process a number of times by now, we will simply show a final-form implementation of a simulation function:

flossSim <- function(len = 150, min = 1, 
                     max = 2, reps = 10000, 
                     graph = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  
  # helper function to pick whihc box gets used:
  whichOne <- function(x) {
    if (x < 0.5 ) {
      return("a")
    } else return("b")
  }
  
  # vector to contains our results:
  leftover <- numeric(reps)
  
  # now the simulation.  Each time we loop through, we
  # start with two fresh boxes and use them until one of
  # them runs out.
  for (i in 1:reps ) {
    # set the initial lengh of the two fresh boxes:
    a <- b <- len
    
    # at the beignning, both can be used:
    bothOK <- TRUE
    
    # start flossing
    while ( bothOK ) {
      # determine which box is picked
      # < 0.5 is a; >= 0.5 is b;
      boxPicked <- whichOne(runif(1))
      
      # if we choose box a, attempt to use floss from it
      if ( boxPicked == "a" ) {
        if ( a < 1 ) {
          leftover[i] <- b
          bothOK <- FALSE
        } else {
          useAmount <- min(runif(1, min, max), a)
          a <- a - useAmount
          if (abs(a) < 10^(-4)) {
            leftover[i] <- b
            bothOK <- FALSE
          }
        }
      }
      
      # use floss from b (if we choose it) 
      if ( boxPicked == "b" ) {
        if ( b < 1 ) {
          leftover[i] <- a
          bothOK <- FALSE
        } else {
          useAmount <- min(runif(1, min, max), b)
          b <- b - useAmount
          if (abs(b) < 10^(-4)) {
            leftover[i] <- a
            bothOK <- FALSE
          }
        }
      }
        
    }  # end while  loop
  }    # end for loop
  
  # report results:
  if ( graph ) {
    plotTitle <- paste0("Dental-Floss Simulation with ", reps, " Repetitions")
    p <- ggplot(mapping = aes(leftover)) + geom_density() +
      labs(x = "Amount left in spare (ft)",
          title = plotTitle)
    # add a rug of individual points, provided there are not too many:
    if ( reps <= 100 ) {
      p <- p + geom_rug()
    }
  }
  print(p)
  cat("The average amount left in the spare is:  ", 
      mean(leftover), ".", sep = "")
}      # end flossSim

Let’s give it a try with one thousand repetitions.

flossSim(reps = 1000, graph = TRUE, seed = 3535)
Density plot of the results of a dental-floss simulation.

Figure 6.3: Density plot of the results of a dental-floss simulation.

## The average amount left in the spare is:  16.55715.

A density plot of the results appears as Figure 6.3. The expected value of the amount left in the remaining box is about 16.3 feet, but you can see from the plot that there is quite a bit of variability: it’s not all that unlikely to have 40 or more feet left!

6.8 Example: The Drunken Turtle

Let us return to the scenario of the Drunken Turtle that was introduced in Section 5.4.5. At each step the turtle would turn by a random angle—any angle between 0 and 360 degrees with equal likelihood—and then move forward by a fixed number of units. We wondered whether the turtle would simply wander away from its starting point or perhaps eventually loop back close to its starting point.

In this section we will construct a simulation that sheds light on our question. We will imagine that the turtle starts at the origin on the \(xy\)-plane and takes a large but fixed number of steps—one thousand steps, say—each of a fixed length of one unit, but each of which is preceded by a drunken turn through a random angle. We will concern ourselves with the random variable \(X\), defined as follows:

\(X\) = the number of times in the turtle returns to within \(1/2\) of a unit of the origin, during its first 1000 steps.

(Apparently we consider a “close return” to be anything within half a unit of the origin.)

In particular, we would like to know the distribution of \(X\): what is the chance that the turtle does not make a close return during the first thousand steps? What’s the chance of making only one close return? Two close returns? And so on … . We would also like to estimate the expected value of \(X\): the average number of close returns in the first 1000 steps.

We will next consider how to code a simulation. Obviously we will need a way to record the turtle’s position at each step. In order to accomplish this we will have to apply the unit-circle trigonometry you learned in high school.

Instead of measuring angles in degrees, we will work in radians: by default R does trigonometry that way. Recall that 360 degrees is \(2 \pi\) radians; hence the way to determine the angle of a turn by the drunken turtle is with the following call to runif():

runif(1, min = 0, max = 2*pi)
## [1] 4.146176

If we want a thousand drunken turn-angles, then we can get them all in a vector, as follows:

runif(1000, min = 0, max = 2*pi)

So much for the turning angles. But in order to decide at each step whether the turtle has made a close return, we have to know the coordinates of the turtle after each step. This is where unit-circle trigonometry comes in handy. Recall that if a right triangle has a hypotenuse of length 1 (the length of a turtle step) and if one of the angles of the triangle is \(\alpha\), then the side adjacent to \(\alpha\) is \(\cos(\alpha)\) and the side opposite is \(\sin(\alpha)\). The situation is illustrated in Figure 6.4.

Sides of a right triangle with hypotenuse 1, in terms of the angle shown.

Figure 6.4: Sides of a right triangle with hypotenuse 1, in terms of the angle shown.

From this principle it follows that if the turtle turns a random angle of \(a_1\) radians for its first step, then after it takes the one-unit step it will be at the point \((\cos(a_1), \sin(a_2))\).

For the second step, the turtle begins by turning another random angle, say \(a_2\) radians. It then takes another one-unit step. This second step forms the hypotenuse of a second right triangle (see Figure 6.5). The horizontal and vertical sides of this right triangle must be \(\cos(a_2)\) and \(\sin(a_2)\) respectively. When the turtle complete the second step, it has added these values to its previous \(x\) and \(y\) coordinates, so that it is now at the point:

\[(\cos(a_1) + \cos(a_2), \sin(a_1) +\sin(a_2)).\] We now see the idea: in order to record the position of the turtle at the end of each step, we simply need to:

  • take the cosine of all the turning-angles;
  • take the sine of all the turning-angles;
  • sum the cosines to get the \(x\)-coordinate;
  • sum the sines to get the \(y\)-coordinate.
The first two steps by the Drunken Turtle.  Turtle turns by $a_1$ radians for Step One, then by $a_2$ radians for Step Two.

Figure 6.5: The first two steps by the Drunken Turtle. Turtle turns by \(a_1\) radians for Step One, then by \(a_2\) radians for Step Two.

Let’s put this idea into practice, for just the first ten steps. First, get the turning angles, the sines and the cosines:

angle <- runif(10, 0 , 2*pi)
xSteps <- cos(angle)
ySteps <- sin(angle)

Next, we need to add the xSteps together. If we do a sum, we would get the \(x\)-coordinate after the tenth step:

sum(xSteps)
## [1] -1.370258

(Apparently some of the cosines were negative, meaning that some of the turning angles were between 90 and 270 degrees.)

We need the \(x\)-coordinates after all of the previous steps, as well. This can be accomplished by R’s cumsum() function. You may think of cumsum as short for “cumulative sum.” Given a vector vec of numbers, cumsum(vec) returns a vector of the same length, where for every index i, cumsum(vec)[i] is sum(vec[1:i]). This is made clear in the following example:

vec <- c(3, 2, -1, 4, -2)
cumsum(vec)
## [1] 3 5 4 8 6

Look at the table below.

vec cumulative sum cumsum(vec)
3 3 3
2 \(3+2\) 5
-1 \(3+2+-1\) 4
4 \(3+2+-1+4\) 8
-2 \(3+2+-1+4+-2\) 6

See how it works?

Returning to xSteps, we see that we can get the \(x\)-coordinates of the first ten steps like this:

x <- cumsum(xSteps)
x
##  [1] -0.6034165 -1.3905805 -2.1259607 -3.1154380 -2.4593130 -1.5476110 -0.8591924
##  [8] -1.6420445 -0.6421761 -1.3702583

Similarly we can get the first ten \(y\)-coordinates:

y <- cumsum(ySteps)

Using the distance formula, we can compute the distance of each point from the origin as follows:

dist <- sqrt(x^2 + y^2)

The following logical vector will then record, at each each step, whether or not the turtle has made a close return:

closeReturn <- (dist < 0.5)

The results are shown in Table 6.3.

Table 6.3: Record of the turtle’s first ten steps. It has not yet made a close return.
angle xSteps ySteps x y dist closeReturn
4.0646104 -0.6034165 -0.7974262 -0.6034165 -0.7974262 1.000000 FALSE
2.4769935 -0.7871641 0.6167437 -1.3905805 -0.1806826 1.402270 FALSE
3.8861615 -0.7353801 -0.6776548 -2.1259607 -0.8583374 2.292695 FALSE
2.9963954 -0.9894774 0.1446876 -3.1154380 -0.7136497 3.196131 FALSE
0.8551238 0.6561251 0.7546522 -2.4593130 0.0410024 2.459655 FALSE
0.4233886 0.9117020 0.4108522 -1.5476110 0.4518546 1.612226 FALSE
0.8114898 0.6884186 0.7253136 -0.8591924 1.1771682 1.457373 FALSE
2.4700328 -0.7828521 0.6222079 -1.6420445 1.7993761 2.435993 FALSE
0.0162276 0.9998683 0.0162269 -0.6421761 1.8156030 1.925826 FALSE
3.8968689 -0.7280822 -0.6854899 -1.3702583 1.1301131 1.776165 FALSE

It remains only to package our code into a simulation function. We will allow the user to specify the fixed initial number of steps (it does not have to be 1000), and also to determine what distance counts as a “close return.”

drunkenSim <- function(steps = 1000, reps = 10000, close = 0.5, 
                       seed = NULL, table = FALSE) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  
  # set up returns vector to store the number of 
  # close returns in each repetition:
  returns <- numeric(reps)
  
  for (i in 1:reps) {
  angle <- runif(steps, 0 , 2*pi)
  xSteps <- cos(angle)
  ySteps <- sin(angle)
  
  x <- cumsum(xSteps)
  y <- cumsum(ySteps)
  
  dist <- sqrt(x^2 + y^2)
  closeReturn <- (dist < 0.5)
  returns[i] <- sum(closeReturn)
  }
  
  if ( table ) {
    cat("Here is a table of the number of close returns:\n\n")
    tab <- prop.table(table(returns))
    print(tab)
    cat("\n")
  }
  cat("The average number of close returns was:  ", 
      mean(returns), ".", sep = "")
}

Let’s try it out:

drunkenSim(steps = 1000, reps = 10000, 
           close = 0.5, seed = 3535, table = TRUE)
## Here is a table of the number of close returns:
## 
## returns
##      0      1      2      3      4      5      6      7      8      9     10     11 
## 0.3881 0.2290 0.1517 0.0908 0.0572 0.0335 0.0227 0.0107 0.0064 0.0048 0.0018 0.0012 
##     12     13     14     15     16     17 
## 0.0008 0.0007 0.0003 0.0001 0.0001 0.0001 
## 
## The average number of close returns was:  1.5655.

Notice that 38.81% of the time the turtle did not make a close return at all during the first thousand steps. On the other hand it is possible, though not at all likely, for the turtle to make quite a few close returns.22

It is interesting to look specifically at the distances from the origin. The following function draws a graph of the distance from the origin for the specified number of turtle steps:

drunkenSim2 <- function(steps = 1000, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  
  angle <- runif(steps, 0 , 2*pi)
  xSteps <- cos(angle)
  ySteps <- sin(angle)
  x <- cumsum(xSteps)
  y <- cumsum(ySteps)
  dist <- sqrt(x^2 + y^2)
  plotTitle <- paste0("First ", steps, " Steps of the Drunken Turtle")
  p <- ggplot(mapping = aes(x = 1:steps, y=dist)) + geom_line() +
    labs(x = "Step", y = "Distance From Origin",
         title = plotTitle)
  print(p)
}

Let’s try it out! The results appear in Figure 6.6

drunkenSim2(steps = 1000, seed = 2525)
Line plot of the Turtle's distance from the origin, as a function of how many steps the turtle has taken.

Figure 6.6: Line plot of the Turtle’s distance from the origin, as a function of how many steps the turtle has taken.

It appears that although the turtle might have returned close to the origin a couple of times, for the most part it seems to be wandering further and further away. In the mathematical field known as stochastic processes, it has been shown that if the turtle is allowed to go on forever then it will certainly make a close return infinitely many times. These returns, though, will tend to be spaced further and further apart as time goes on, and our graph suggests that this is so.

Glossary

Probability of an Event

The long-run proportion of times that the event will occur if the underlying random process is repeated many, many times.

Random Variable

A number whose value depends on the outcome of a chance process.

Expected Value of a Random Variable

The long-run average of the values of a random variable, where the underlying chance process is repeated many, many times.

Continuous Random Variable

A random variable that can assume any value in a range of real numbers.

Distribution of a Random Variable

A statement of the probabilities for a random variable to assume its various possible values.

Encapsulation and Generalization

A method for developing a computer program according to which the programmer first designs a basic procedure, then encapsulates it in one or more functions, and then progressively generalizes these functions until the program possesses all desired features.

Exercises

  1. Consider the following two games.

    • In Game A you flip a fair coin. If the coin comes up Heads you get two dollars, whereas if it comes up Tails you get one dollar.
    • In Game B you roll a fair die. If the six-spot comes up, you win twenty-five dollars. If you get 2, 3, 4, or 5, nothing happens. If the one-spot comes up, you lose fifteen dollars.

    In this exercise you need to answer two questions:

    1. If you could choose to play either Game A or Game B just once, which game would you prefer to play, and why?
    2. Suppose that you can choose either to play Game A ten thousand times, or Game B ten thousand times. Which choice would you prefer, and why?

    Answers to the first question will vary from person to person, depending on circumstances and personal taste. On the other hand, astute consideration of expected values leads most people to answer the second question in the same way. Therefore, before you answer the question please write two simulation functions: one called GameA that provides an estimate of the expected value of the winnings from a play of Game A, and another called GameB that estimates the expected value of the winnings from a play of Game B. The functions should have two parameters:

    • a reps parameter to permit the user to choose the number of repetitions of the game;
    • a seed parameter to permit the user to choose a starting seed.

    Use both functions with fifty thousand reps, in order to obtain estimates for the expected amount of winnings in a single play of each game. Use a simple starting seed for each function. Report the seeds you used, along with the estimated expected values for each game. Use your results to decide which game is the better choice to play ten thousand times.

    Note: Please keep in mind that the fifty-thousand figure is just the number of times you repeat your simulations in order to estimate the expected value of a single play of one of the games. It’s not how many times you actually get to play either game—that’s ten thousand.

  2. Reconsider the meeting of Anna and Raj. Suppose that they still plan to meet during the same one-hour period, but that Anna is willing to wait \(a\) minutes and Raj is willing to wait \(r\) minutes, where \(a\) and \(r\) could differ. Modify the meetupSim() function so that it accepts two additional parameters:

    • a: the time Anna is willing to wait;
    • r: the time Raj is willing to wait;

    Apply the function with ten thousand repetitions (and a simple seed that you report) in order to estimate the probability of a meeting, assuming that Anna is willing to wait 13 minutes and Raj is willing to wait 7 minutes.

    Hint: Suppose that Anna arrvies at time \(Anna\) and Raj arrives at times \(raj\). We want a simple mathematical expression that gives the conditions for Anna and Raj to connect. In the example in the text, this was:

    \[\vert\ anna - raj \ \vert < 10\]

    because both Anna and Raj were willing to wait the same amount of time: ten minutes. Now we have to think a bit more carefully when Anna waits \(a\) minutes and Raj waits \(r\) minutes, where \(a\) and \(r\) may differ. You might start by observing that we must have

    \[anna - r \le raj,\]

    for if not then Raj would arrive and leave before Anna gets there. Also, we would have to have

    \[raj \le anna + a,\]

    for if not then Anna would arrive and leave before Raj gets there. But if we have both of those conditions:

    \[anna - r \le raj \text{ and } raj \le anna + a\]

    then they must connect, since neither person will arrive too much before the other.

  3. Review the function dual() in the Practice Exercises of this Chapter (see Section 6.3.4). Modify the function as follows:

    • Add a parameter abe that gives the probability for Abe to hit when he shoots, and a parameter bo that gives the chance for Bo to hit when he shoots. the probabilities should be expressed as numbers between 0 and 1. (Thus, a 30% chance of hitting is 0.30.)
    • Make the function estimate the expected value of the number of shots taken in a dual, rather than the chance for Abe to win.

    When you have your function working, use it to estimate the expected number of shots when:

    • Abe’s chance of hitting is 50 percent and Bo’s chance if 50%;
    • Abe’s chance of hitting is 20% and Bo’s chance of hitting is 25%.

    In each simulation, use one hundred thousand repetitions and a seed that you choose.

  4. A candy-maker makes a particular type of candy that always has 20 candies in each bag. When you buy a bag of this candy, you take a few candies from it every day until they are all gone. The amount you take each day is a random whole number from 1 to 10. A possible scenario is:

    • Day 1: take 4 candies. (There are 16 left.)
    • Day 2: take 8 candies. (There are 8 left.)
    • Day 3: take 3 candies. (There are 5 left.)
    • Day 4: take 6 candies. (There were only 5 left, so you take all the remaining candies.)

    In the scenario above, it took you four days to finish the bag.

    Write a function called daysNeededSim() that will estimate the expected number of days required to empty a bag. It should take two parameters:

    • reps: the number of times to simulate the process of emptying one bag. The default value should be 10,000.
    • seed: a random seed that the user can set. The default value should be NULL.

    Apply the function with 50,000 repetitions and a seed that you report, in order to estimate the expected number of days required to empty the bag.

    Hint: This scenario is quite similar to the numberNeededSim() problem from this Chapter. Begin by reviewing it.

  5. (*) Refer back to the numberNeededSim() function. Write a program that computes the estimated expected number needed, for the following targets:

    targets <- seq(0.05, 0.95, by = 0.05)

    Each estimate should be based on ten thousand repetitions. Start with a simple seed that you report. Report the estimated expected values for all of the required targets. Compare these estimates with exp(targets) (the number \(e\) raised to the power of each of the targets.) Formulate a conjecture about the value of the expected number of numbers needed, when the target is a real number between 0 and 1.

    Hint: Recall that numberNeededSim() function returns its estimate of the expected number needed. Hence you could use it inside a for-loop that iterates through the elements of the targets vector above, something like this:

    targets <- seq(0.05, 0.95, by = 0.05)
    n <- length(targets)
    results <- numeric(n)
    set.seed(3030) # or some other simple seed that you like
    for ( i in 1:n ) {
      # use numberNeededSim() with:
      #    * 10000 reps
      #    * seed left at the default NULL (you provided one already)
      #    * target set to targets[i]
      #    * report left at FALSE (no need to have R talk to you at each step)
      # make sure to store the estimate in results[i]
    }
    # after the loop, you can compare results with exp(targets)
  6. (*) A pipe-smoker has two boxes of matches in his pocket. Both boxes contain 40 matches. Every time he lights his pipe, the smoker reaches into his pocket and randomly picks one of the boxes, pulling a match from the box. Eventually one of the boxes runs out. What is the expected value of the number of matches remaining in the other box at that time? Write a simulation function to estimate the answer. Your function should have at least a reps and a seed parameter. Apply the function with ten thousand repetitions—and with a simple seed that you report—in order to estimate the expected value.

  7. (*) A gambler starts with 10 dollars, and repeatedly plays a game. If she wins the game she gets one dollar and if she loses the game she loses one dollar. Her chance of winning each game is a fixed number that is less than 0.50. Write a function to simulate the gambler repeatedly playing the game until her money runs out. The function should keep track of how much money she has left after each play, so that it can produce a line graph (similar to the one in the section on the Drunken Turtle) of the money left after each play. Write the function so that it has

    • a start parameter for the initial amount of money the gambler has;
    • a seed parameter`;
    • a p parameter for the chance of winning, so that the user can enter any chance less than or equal to 0.5. (The function should stop the user if the user enters more than 0.5.)

    Apply the function with a simple seed that you report, a starting amount of ten dollars, and a chance of winning equal to 0.45.

  8. (*) An absent-minded professor walks to and from campus every day. Prior to every commute—either from home to office or the reverse—she checks the weather, taking an umbrella with her if it is raining. The problem is that she never takes an umbrella when the weather is dry, so even though she owns quite a few umbrellas the time comes eventually when it is raining but all of her umbrellas are at her destination, so that she will have to put up with a wet commute.

    Suppose that there are \(x\) umbrellas at her house and \(y\) umbrellas at her office, and that she begins at home. Suppose further that the chance of rain on any commute is \(p\), a number between 0 and 1.

    Write a simulation function to estimate the expected number of dry commutes that she will enjoy before her first wet commute. The function should have five parameters:

    • x: the initial number of umbrellas at home (default 1);
    • y: the initial number of umbrellas at the office (default 1);
    • p: the probability of rain at any commute (default 0.5);
    • seed: an seed for the randomizers (default NULL).
    • reps: the number of times to simulate commuting until wet.

    Apply the function with x and y set at three, for the values 0.1, 0.5 and 0.9 for p. Use 10,000 repetitions for each simulation, with seeds that you report. Which value of p results in the smallest expected number of dry commutes?