6 Simulation
6.1 Probability and Random Variables
Your favorite college basketball team is in the NCAA Tournament. What is the chance that it will make the Final Four?
An insurance company provides flood insurance to homeowners. How much can the company expect to pay out in insurance claims in the coming year?
What is the chance of a large asteroid striking our planet in the next ten years? in the next ten million years?
All of the above questions involve random processes—some of them rather complex. It is the aim of this Chapter to apply our knowledge of vectors, functions and flow control to give approximate answers to questions involving chance processes like these.
Sometimes we will be interested in finding the probability that an event of interest occurs: for example, the event that your team makes the final Four, or the event that the Earth is struck by a large asteroid sometime in the next ten years.
One popular definition of probability goes as follows:
If \(E\) is an event, then the probability that \(E\) occurs is the long-term proportion of times that the event occurs, if we could repeat the random process many, many times.
To be more precise, imagine that you can repeat the random process \(n\) times. (Imagine that your team gets into the NCAA Tournament \(n\) times, for example.) Each time, the event of interest will either occur or not occur. Count the number of times that the event occurs. Then divide by \(n\). You now have the proportion of times that it occurred. Now imagine that \(n\) gets larger and larger. Our intuition says that the proportion of times that the event occurs will stabilize at some number between 0 and 1. This number is the probability that the event occurs. occurs.
We will also concern ourselves with random variables. A random variable is simply a number whose value is determined by the outcome of a chance process. The amount that the insurance company will pay out in the next year is an example of a random variable: its value depends on a complex chance process—how many homeowners experience a flood, how damaging each flood is, etc.
When it comes to random variables, we are often interested in what it might turn out to be on average. That is, suppose we could repeat the random process many, many times—say \(n\) times, where \(n\) is some large number. (Suppose that we could observe the insurance company for many, many years, with each year being the same in terms of how many people are insured, how much their houses are worth, what the climate is like, and so on.) Each time the process is repeated, we get a value for the random variable. We end up with \(n\) values. Now add up these values and divide by \(n\). We have computed their average—the mean, as it is properly called. Now imagine that \(n\) could be larger and larger, without bound.
The expected value of the random variable is the number that this average converges to.
In other words, the expected value of a random variable is what we expect to get on average, in the long run, if we could repeat the random process that produces the value many, many times.
6.2 Estimating the Probability of an Event
Our definitions of probability and expected value both involved a limiting notion, namely: what would happen if you could somehow repeat the random process more and more times, without a bound on the number of repetitions. Accordingly, even if we find that we are unable to compute a probability or an expected value exactly with mathematics, we can still attempt to estimate it by making the computer repeat the random experiment many times, keeping track of the result of the experiment each time. This technique is known as Monte Carlo simulation, after the famous Monte Carlo casino in the Principality of Monaco.
In this section we will employ Monte Carlo simulation to estimate probability and expected value in a couple of simple examples.
6.2.1 Example: Picking From a Box
Consider a box that holds ten tickets. Three of them are labeled with the letter “a”; the rest are labeled “b”:
<- c(rep("a", 3), rep("b", 7))
tickets tickets
[1] "a" "a" "a" "b" "b" "b" "b" "b" "b" "b"
We plan to draw one of these tickets at random from the box, with each ticket having the same chance to be the ticket selected. Three of the ten tickets are a’s, so our intuition says that the probability of selecting an “a” ticket is 3 out of 10, or 0.3. Let’s use Monte Carlo simulation to estimate the probability, and see if we get something close to 0.3.
The first step in building the simulation is create a model, within R, of the scenario. In other words, we need to write some code to make the box of tickets.
One way to do this is as follows:
<- c(rep("a", 3), rep("b", 7))
tickets tickets
[1] "a" "a" "a" "b" "b" "b" "b" "b" "b" "b"
The vector tickets
represents the box, and its elements—the characters "a"
and "b"
—represent the tickets themselves.
Next, we need some R-code that represents picking a ticket from the box at random. The sample()
function can help us:
sample(tickets, size = 1)
[1] "a"
As our next step, we need to write some R-code that represents repeating the action of drawing from the box many times. Along the way, we must keep a record of what the samples were, so that at the end we can figure out what proportion of the times we got the "a"
ticket.
From the our previous study of flow-control, we know how to use a for
-loop to store results (see Section 4.3.1.2). We will apply that knowledge here, sampling from the box a large number of times—say, one thousand times. Each time, we will record whether or not we got an "a"
-ticket.
set.seed(5050)
<- logical(length = 1000)
results for (i in 1:1000) {
<- sample(tickets, size = 1)
selection <- selection == "a"
got_a <- got_a
results[i] }
We now have a logical vector of length 1000. In this vector, TRUE
means that we picked "a"
, FALSE
means that we picked "b"
.
Finally, we compute the proportion of times that we got the "a"
, because this will be our estimate of the probability of getting an a-ticket form the box.
To do this, we can count the number of of TRUE
s in the results
vector, and divide by 1000. Recall that to count, we can use the sum()
function on results
(see Section 2.5.1):
sum(results)
[1] 321
We need only divide by 1000 in order to get our estimated probability:
sum(results) / 1000
[1] 0.321
The mean()
function will do it all at once for us:
mean(results)
[1] 0.321
Putting it all together, here is the work we did to estimate the probability of getting an a-ticket:
## set up the box:
<- c(rep("a", 3), rep("b", 7))
tickets ## set up vector to store results:
<- logical(length = 1000)
results ## get many results with loop:
for (i in 1:1000) {
## pick from the box once:
<- sample(tickets, size = 1)
selection ## figure out whether the event occurred:
<- selection == "a"
got_a ## store in results vector:
<- got_a
results[i]
}## compute estimated probability of a:
mean(results)
That’s a Monte-Carlo simulation!
Of course our simulated picks are random, so estimate we got is subject to chance variation. Sampling from the box a much larger number of times would reduce the effect of chance variation, yielding an estimate that probably comes closer to the true probability of drawing an a-ticket.
So it makes sense to let the number of picks vary. Accordingly, we modify our code so that we can easily set the number of simulations to perform:
## set up how many simulations to perform.
## try 10,000 this time
<- 10000
reps <- c(rep("a", 3), rep("b", 7))
tickets <- logical(length = reps)
results for (i in 1:reps) {
<- sample(tickets, size = 1)
selection <- selection == "a"
results[i]
}mean(results)
[1] 0.2932
The larger the value of reps
, the closer our estimate is liable to be to 0.3, the true probability of drawing an a-ticket.
It makes a lot of sense to encapsulate our work into a function, so that we can easily repeat the Monte-Carlo simulation:
<- function(reps) {
ticket_sim <- c(rep("a", 3), rep("b", 7))
tickets <- logical(length = reps)
results for (i in 1:reps) {
<- sample(tickets, size = 1)
selection <- selection == "a"
results[i]
}mean(results)
}
Estimating a Probability With a Loop
In general, when we use a loop to perform a Monte-Carlo simulation to estimate a probability, our code will follow this template:
<- logical(length = reps_desired)
results for (i in 1:reps_desired)
## some code to make the random process happen once
## more code to find out whether the event occurred
<- whether_event_occurred
results[i] }
mean(results)
6.2.2 Loops Are Not Always Needed
When we used the sample()
function, we set the size
parameter to 1, to get the random process of picking from the box to happen once.
It is perfectly possible, however, to simulate the random process many times, simply by setting size
to the number of repetitions that we want:
<- sample(tickets, size = 1000, replace = TRUE) selections
It was important to mention the parameter replace
, as it determines whether or not R will replace the the element it picks, each times it makes its a selection. The default value of replace
is FALSE
so if we keep replace
at its default value then we cannot sample more than the number of elements in the vector we are sampling from. Look at this:
## pick 5 without replacement
sample(tickets, size = 5)
[1] "b" "b" "b" "a" "b"
## pick 10 w/o replacement, randomly scrambling the tickets:
sample(tickets, size = 10)
[1] "b" "b" "b" "b" "a" "a" "b" "b" "b" "a"
## trying to pick 20 tickets w/o replacement (oops):
sample(tickets, size = 20)
Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
(If any of the above is not familiar, then review Section 5.4.1!)
We now have all of our picks at once. We can then use the vectorization abilities of the ifelese()
function to find out all at once whether or not we got an "a"
each time:
<- ifelse(
got_a == "a",
selections TRUE,
FALSE
)
Then we can compute our estimate of the probability of drawing an a-ticket:
mean(got_a)
[1] 0.288
We can encapsulate our work into a function:
<- function(reps) {
ticket_sim_2 <- c(rep("a", 3), rep("b", 7))
tickets <- sample(tickets, size = reps, replace = TRUE)
selections <- selections == "a"
got_a mean(got_a)
}
Which functions is better to use? On the one hand it doesn’t matter, because both functions deliver perfectly good simulations.
On the other hand R, the second function runs noticeably faster, for a large number of repetitions. We can check this with the function
system.time({
<- ticket_sim(reps = 100000)
estimate_1 })
user system elapsed
0.965 0.021 1.011
The simulation too 1.011 seconds on my Mac Book Pro computer.
Here is the estimate:
estimate_1
[1] 0.29938
Now we try with the loop-less simulation:
system.time({
<- ticket_sim_2(reps = 100000)
estimate_2 })
user system elapsed
0.024 0.003 0.029
The simulation completed in 0.029 seconds, with an estimate of:
estimate_2
[1] 0.29779
Both of the estimates are pretty good, as we might expect with one hundred thousand repetitions.
Both of the simulations also finished pretty quickly, but the simulation with the loop took \(1.011 / 0.029 \approx 37\) times as long to complete. If we for some reason we needed to simulate with an extremely large number of repetitions, or if for some reason it takes a significant amount of time to simulate just once, then a loop-less solution is clearly preferable.
So bear the following in mind:
When performance is an issue, prefer vectorization to looping as much as possible.
6.2.3 Example: Hat Toss
Suppose that a group of \(n\) people get together. Each person has a hat.
- They toss their hats into the air.
- As the hats land, each person scrambles around to find a hat.
What is the probability that at least one person will pick up her own hat?
Let’s try to work out a specific case, say, where there are 10 people.
<- 10 n
We can model the situation like this:
<- 1:10
people <- 1:10 hats
The idea is the the i-th hat belongs to the i-th person, for each \(i\), from 1 to \(n\).
We can simulate the random toss-and-pickup process by making sample()
randomly scramble the hats:
<- sample(hats, size = n, replace = FALSE) picked_up_hats
Comparing people, with the scrambled hats, we can see whether anyone got their own hat:
people
[1] 1 2 3 4 5 6 7 8 9 10
picked_up_hats
[1] 1 9 10 5 4 3 2 7 8 6
Looks like person 1 got her own hat!
We can make R compute whether or not at least one person got her own hat:
any(people == picked_up_hats)
[1] TRUE
Let’s make a simulation-function to repeat this procedure many times and return an estimate of the probability that at least one person picks up her own hat:
<- function(n, reps) {
hat_toss_sim <- 1:n
people <- 1:n
hats <- logical(reps)
results for (i in 1:reps) {
<-
picked_up_hats sample(
size = n,
hats, replace = FALSE)
<-
got_a_match any(people == picked_up_hats)
<- got_a_match
results[i]
}mean(results)
}
Let’s try the simulation, for \(n = 10\) people:
hat_toss_sim(n = 10, reps = 10000)
[1] 0.6287
The following fact is quite interesting:
As the number of people increases, the probability that at least one person picks up her own hat converges to:
\[1- 1/\mathrm{e} \approx 0.63.\]
e is an irrational number, \(e \approx 2.718\), that shows up a lot in mathematics.
6.2.4 Practice Exercises.
These are just drills on sample()
.
Write R-code to pick one letter at random, from the vector
letters
.Write R-code to randomly pick seven different letters from
letters
.Write R-code to randomly pick seven letters from
letters
. Each time you pick, you should have the same chance of getting any of the 26 letters.Write R-code to randomly pick one thousand letters from
letters
. Each time you pick, you should have the same chance of getting any of the 26 letters.Write R-code to scramble the 26 letters in the vector
letters
in a random order.A box contains twenty five-dollar bills and fifteen ten-dollar bills. Write R-code to simulate picking a bill from the box at random, one hundred times. (Each time you pick a bill, you return it to the box.)
6.2.5 Solutions to the Practice Excercises
sample(letters, size = 1)
Here we use the fact that the default-value of
replace
isFALSE
:
sample(letters, size = 7)
[1] "g" "w" "t" "s" "o" "c" "d"
sample(letters, size = 7, replace = TRUE)
sample(letters, size = 1000, replace = TRUE)
sample(letters, size = length(letters))
I can think of a couple of solutions. In the first solution, you set up a box and then sample from it:
<- c(rep(5, times = 20), rep(10, times = 15))
box sample(box, size = 100, replace = TRUE)
[1] 10 5 5 5 10 5 5 5 10 5 10 10 10 5 5 5 10 5 10 5 5 5 5 10 10
[26] 5 5 5 5 5 10 5 10 5 5 10 5 5 5 5 5 5 10 5 5 5 10 10 5 5
[51] 10 10 10 10 5 10 5 10 10 10 5 5 5 10 5 5 10 10 5 5 10 5 5 10 5
[76] 5 10 5 10 5 5 10 5 5 10 5 5 5 10 5 10 10 5 5 10 10 10 5 5 5
Here is another approach, where you use the proportions of bills in the box:
sample(
c(5, 10),
size = 100,
prob = c(20 / 35, 15 / 35),
replace = TRUE
)
[1] 5 5 5 5 5 5 10 5 10 5 10 10 5 10 5 5 10 10 5 10 5 5 5 5 5
[26] 5 5 5 10 5 5 10 10 5 10 5 5 10 5 10 5 5 5 10 10 5 10 5 5 10
[51] 10 10 10 10 5 10 10 5 5 5 10 5 10 10 5 10 10 5 5 5 5 10 10 10 10
[76] 10 5 10 10 5 10 5 5 10 10 5 10 10 10 10 10 10 5 10 5 5 10 5 5 10
- Try this:
sample(5:10, size = 4, replace = TRUE)
[1] 6 5 7 7
6.3 Estimating the Expected Value of a Random Variable
We can use the Monte-Carlo idea to estimate expected values of random variables, too.
6.3.1 Example: a Coin-Toss Game
Imagine that you are about to play a game, where you will flip a fair coin twice.
- If you get Tails both times, you lose a dollar.
- If you get exactly one Head, nothing happens.
- If you get Heads both times, you win two dollars.
Let \(W\) be the number of dollars you will win.
\(W\) is a clearly a random variable: it’s a number whose value—either -1, 0 or 2—depends on the outcome of the random process of flipping the fair coin twice. What is the expected value of \(W\)?
We would like to code a Monte-Carlo procedure for estimating the expected value. In order to do this, we have to write some code that represents playing the game once.
In order to write the code for playing the game once, we must think about the probabilities involved. When you flip a fair coin twice, there are four equally likely outcomes:
- Tails and then Tails (\(W=-1\))
- Tails and then Heads (\(W=0\))
- Heads and then Tails (\(W=0\))
- Heads and then Heads (\(W=2\))
Therefore:
- The chance that \(W=-1\) is 0.25.
- The chance that \(W=0\) is 0.50.
- The chance that \(W=2\) is 0.25.
The above statement of the chances for \(W\) to take on each of its possible values is called the distribution of \(W\).
Let’s record this information in R:
<- c(-1, 0, 2)
possible_winnings <- c(0.25, 0.50, 0.25) probabilities
Together, these two vectors represent the rules of the game, in R code!
Next we consider how to write the code that represents playing the game once. It will please you to learn that once again this can be done with the sample()
function:
sample(possible_winnings, size = 1, prob = probabilities)
[1] 2
Here we have used the parameter prob
to specify that when R picks from possible_winnings
it must:
- give the -1 a 0.25 chance to be picked,
- give the 0 a 0.50 chance to be picked, and
- give the 2 a 0.25 chance to be picked.
Now we can write a program to estimate the expected value of \(W\). Following, Remark 4.1, we the program will use a loop to produce and store large number of simulations.
First, we decide how many simulations we want:
<- 1000 reps
Next, we set up a results vector of the required length:
<- numeric(length = reps) winnings
(The results-vector is numeric this time, because we want to record winnings, which are numbers.)
Then we write a for
-loop to populate the results vector with simulated results:
for (i in 1:reps) {
<- sample(
winnings[i]
possible_winnings, size = 1,
prob = probabilities
) }
Now the expected value of a random variable is the average of the values you get if you “try” the random variable many, many times. Hence it makes sense that to estimate the expected value of \(W\) we can just compute the mean of the winnings:
mean(winnings)
[1] 0.239
This number is our estimate of the amount of money you would win per play of the game, if you played the game many times.
Of course, to get a better estimate we should employ more repetitions. Let’s prepare to do this by first encapsulating our work into a function:
<- function(reps) {
coin_game_sim ## set up the game:
<- c(-1, 0, 2)
possible_winnings <- c(0.25, 0.50, 0.25)
probabilities ## set up the results vector:
<- numeric(length = reps)
winnings ## one-by-one, get the results:
for (i in 1:reps) {
<- sample(
winnings[i]
possible_winnings,size = 1,
prob = probabilities
)
}## return the estimate of expected value:
mean(winnings)
}
The above illustrates a fool-proof general way to estimate the expected value of a random-variable:
Estimating an Expected Value With a Loop
In general, when we use a loop to perform a Monte-Carlo simulation to estimate an expected value, our code will follow this template:
<- numeric(length = reps_desired)
results for (i in 1:reps_desired {
## some code to make the random process happen once
## more code to find out the value that the random variable took on
<- value_that_the_random_variable_took_on
results[i]
}mean(results)
As in the tickets example above, it’s possible to write our simulation with vectorization instead of loops. We do this by making the sample()
function take all of its samples at once.
<- function(reps) {
coin_game_sim_2 <- c(-1, 0, 2)
possible_winnings <- c(0.25, 0.50, 0.25)
probabilities ## get the winnings all in a flash:
<- sample(
winnings
possible_winnings,size = reps,
prob = probabilities,
replace = TRUE ## <-- very important!
)mean(winnings)
}
Let’s try it out, with 100,000 repetitions. Let’s also take care to set the random seed (see Section @ref(setting-seed)).
set.seed(2323)
coin_game_sim_2(reps = 100000)
[1] 0.24825
This estimate is much more likely to be very close to the true expected value. It looks like we can expect to win about 25 cents per play of the game, on average.
6.3.2 The Law of Large Numbers
In our Monte Carlo simulations so far, we have seen that the more times we repeat the underlying random process, the closer our estimate is likely to be to the actual value, no matter whether we were estimating the probability of an event or an expected value for a random variable.
This is no accident: in fact, it is a consequence of a theorem in the subject of probability that is known as the Law of Large Numbers. We do not possess the mathematical machinery necessary to state the Law precisely—much less to prove it—but we can take the rough statement given here as an assurance that the more repetitions we include in our simulation, the more accurate the resulting estimate is liable to be. (Read Section @ref(morelln) if you are curious to know more.)
6.3.3 Extra Feature: A table()
of the Simulations
R has a function called table()
that will tally the number of occurrences of each different element of a vector. Here is an example:
<- c(
some_people "Bettina", "Sita", "Raj", "Ram",
"Ram", "Sita", "Raj", "Bettina",
"Karl", "Sita", "Sita", "Raj"
)table(some_people)
some_people
Bettina Karl Raj Ram Sita
2 1 3 2 4
R has another function called prop.table()
. When you give it a table, it will compute the proportion of times each element occurs:
<- table(some_people)
my_table prop.table(my_table)
some_people
Bettina Karl Raj Ram Sita
0.16666667 0.08333333 0.25000000 0.16666667 0.33333333
If you like, you can round off to fewer decimal places with the round()
function:
round(prop.table(my_table), digits = 3)
some_people
Bettina Karl Raj Ram Sita
0.167 0.083 0.250 0.167 0.333
It can be illuminating to examine a proportions table for the simulations of a random variable such as \(W\), the winnings in our dice game. Accordingly, we modify our simulation function to provide the user with an option for a table:
<- function(reps, table = FALSE) {
coin_game_sim_3 <- c(-1, 0, 2)
possible_winnings <- c(0.25, 0.50, 0.25)
probabilities <- sample(
winnings
possible_winnings,size = reps,
prob = probabilities,
replace = TRUE ## <-- very important!
)if (table) {
cat("Here is a table of the simulated winnings:\n\n")
<- prop.table(table(winnings))
prop_tab print(round(prop_tab, digits = 3))
cat("\n")
}mean(winnings)
}
With the default value of table
we will get just the estimated expected value. But let’s over-ride the default, setting table
to TRUE
:
coin_game_sim_3(reps = 10000, table = TRUE)
Here is a table of the simulated winnings:
winnings
-1 0 2
0.249 0.503 0.248
[1] 0.2473
Notice that the distribution of simulated winnings is very close to the actual distribution of \(W\). By the Law of Large numbers, this is to be expected when we simulate with a large number of repetitions.
6.3.4 Practice Exercises.
Make sure to review sample()
, and also runif()
(see Section 5.4.1 and Section 5.4.3).
Write some R-code to generate four whole numbers at random. The possible numbers should be 5, 6, 7, 8, 9, and 10.
Write some R-code to pick 20 real numbers at random. the numbers should be equally likely to be anywhere from -5 to 5.
Write some R-code to pick 10 real numbers at random. the numbers should be equally likely to be anywhere from 4 to 6.
Now let’s program some more complex simulations:
Consider the following game: you roll a fair die. If you get one or two spots, then you lose a dollar. If you get three, four, or five spots then nothing happens. If you get six spots then you win three dollars. You would like to know how much money you can expect to win per play of the game, on average.
Write a function called
diceGameSim()
that will help you answer the question. The function should take the following parameter:reps
: the number of times to repeat the simulation. The default value should be 10000.
The function should return an estimate of the expected value of the game.
You command a Klingon vessel, and you are shooting photon torpedoes at the Starship USS Enterprise. Every torpedo you fire has a strength \(s\) that is a random real number between 0 and 12. If the strength of the torpedo is less than or equal to 5, then the Enterprise’s shields will ward off any damage. However, if the strength \(s\) exceeds 5, then the damage inflicted on the Enterprise will be \(12-s\) units. What is the expected amount of damage done to the Enterprise, per shot? Write a function called
photonTorpedoSim()
that estimates the expected damage. The function should take the following parameter:reps
: the number of times to repeat the process of firing at the Enterprise. The default value should be 10000.
The function should return the estimated expected value. Note: the torpedo strength is a random real numer from 0 to 12, NOT a random whole number from 0 to 12. You should use
runif()
, notsample()
, to simulate the torpedo strengths.
6.3.5 Solutions to the Practice Exercises
- Try:
sample(5:10, size = 4, replace = TRUE)
[1] 10 5 5 9
- Since there are infinitely many real numbers from -5 to 5, you cannot use
sample()
. You’ll have to userunif()
:
runif(20, min = -5, max = 5)
[1] -3.1113267 4.7812829 4.2167896 3.8535674 3.7988220 -1.4105410
[7] -3.8463985 -0.2599296 0.1279526 1.0630907 -1.5068249 -0.7303148
[13] 4.1769733 4.8493451 -0.2195449 1.9953097 4.1153311 1.8031662
[19] 2.8356723 4.1563114
- Try this:
runif(10, min = 4, max = 6)
[1] 4.011778 4.221673 5.071291 4.310030 5.839662 5.687888 5.704572 4.979183
[9] 4.414390 4.689362
- Here’s the code:
<- function(reps) {
diceGameSim <- c(-1, 0, 3)
outcomes <- c(2 / 6, 3 / 6, 1 / 6)
probs <- sample(
winnings
outcomes,size = reps,
replace = TRUE,
prob = probs
)mean(winnings)
}
Let’s try it out:
set.seed(3838)
diceGameSim(reps = 100000)
[1] 0.16548
Our estimate, based on one hundred thousand repetitions of the game, is that you can expect to win about 16.55 cents per game. (In fact the exact expected value is \(1/6\) of a dollar.)
- Here’s the code, if you use the
pmax()
function:
<- function(reps) {
photonTorpedoSim <- runif(reps, min = 0, max = 12)
strengths <- pmax(0, strengths - 5)
damage mean(damage)
}
Let’s try it out:
set.seed(3838)
photonTorpedoSim(reps = 100000)
[1] 2.043227
You can expect to do about 2.04 points worth of damage per shot. (In fact the exact expected value is \(49/24 \approx 2.04167\).)
Here’s the code if you use ifelse()
:
<- function(reps) {
photonTorpedoSim <- runif(reps, min = 0, max = 12)
strengths <- ifelse(strengths < 5, 0, strengths - 5)
damage mean(damage)
}
Here’s the code if you use a loop:
<- function(reps) {
photonTorpedoSim <- numeric(reps)
damage for (i in 1:reps) {
<- runif(1, min = 0, max = 12)
strength if (strength > 5) {
<- strength - 5
damage[i] else {
} <- 0
damage[i]
}
}mean(damage)
}
6.4 Example: Chance of a Triangle
Let’s revisit an example from Section 4.2.3, where we considered whether or not three segments could be put together to form a triangle. We developed the following function to determine, from the lengths of each segment, whether or not a triangle is possible:
<- function(x, y, z) {
isTriangle + y > z) & (x +z > y) & (y + z > x)
(x }
Recall also that this function applies to vectors of any length, so we can use it to decide about many triples of segments at once.
Let’s now inject some chance variation into the situation. Suppose that you have a stick that is one unit in length: one foot, one meter, one yard—whatever. You plan to break the stick at two randomly selected points. Breaking the stick will yield three pieces. You wonder: what is the chance that these three pieces can form a triangle?
In an advanced probability course you can show that the probability of a triangle is exactly \(1/4\). Here we would like to estimate the probability with Monte Carlo simulation.
In order to carry out the simulation, we will need to develop code that produces a pair of break-points, any number of times that we like—let’s think about ten times, to start.
This is not so difficult if we use the runif()
function. The default values for min
and max
are 0 and 1 respectively, so we can get the required numbers like this:
<- runif(10) # the first breaks
x <- runif(10) # the second breaks y
Next we must compute—for each pair of breaks—the lengths of the three segments that are produced. If we knew that the x
break was always less than the y
break, this would be easy:
- the leftmost piece would be
x
units long; - the middle piece would be
y
minusx
units long; - the rightmost piece would be 1 minus
y
units long.
The problem is that a given element of the vector y
can easily be less than than corresponding element of the vector x
. When that happens, the pieces won’t be as described in the bullet-ed items above.
We can solve this problem with the pmin()
and pmax()
functions that were introduced in Section 2.6.4:
<- pmin(x, y)
a <- pmax(x, y)
b <- a # leftmost
side1 <- b - a # middle
side2 <- 1 - b # rightmost side3
It is useful to write a helper-function that starts from the breaks, computes the sides, and then decides whether they form a triangle:
<- function(x, y) {
makesTriangle <- pmin(x, y)
a <- pmax(x, y)
b <- a
side1 <- b - a
side2 <- 1 - b
side3 isTriangle(x = side1, y = side2, z = side3)
}
Let’s try it our for our repetitions
Table 6.1 shows the results. Based on our ten repetitions, we would estimate the probability of a triangle as \(4/10\).
x | y | side1 | side2 | side3 | triangle |
---|---|---|---|---|---|
0.6469028 | 0.7644140 | 0.6469028 | 0.1175112 | 0.2355860 | FALSE |
0.3942258 | 0.7438358 | 0.3942258 | 0.3496100 | 0.2561642 | TRUE |
0.6185018 | 0.8261657 | 0.6185018 | 0.2076639 | 0.1738343 | FALSE |
0.4768911 | 0.4227291 | 0.4227291 | 0.0541621 | 0.5231089 | FALSE |
0.1360972 | 0.4092877 | 0.1360972 | 0.2731905 | 0.5907123 | FALSE |
0.0673844 | 0.5396926 | 0.0673844 | 0.4723082 | 0.4603074 | TRUE |
0.1291526 | 0.9607224 | 0.1291526 | 0.8315698 | 0.0392776 | FALSE |
0.3931179 | 0.6535573 | 0.3931179 | 0.2604394 | 0.3464427 | TRUE |
0.0025827 | 0.5467153 | 0.0025827 | 0.5441326 | 0.4532847 | FALSE |
0.6202060 | 0.2660636 | 0.2660636 | 0.3541424 | 0.3797940 | TRUE |
We can compute the proportion of times that a triangle was formed by taking the mean of the logical vector triangle
:
mean(triangle)
[1] 0.4
We would like, of course, to simulate breaking the unit length many times, and we would like to be able to choose easily—without a lot of fuss in coding—the number of repetitions. So once again it makes sense to write a simulation function that will do our work for us:
<- function(reps = 10000) {
triangleSim <- runif(reps)
cut1 <- runif(reps)
cut2 <- makesTriangle(cut1, cut2)
triangle cat(
"The chance of being able to form a triangle\n",
"is approximately ", mean(triangle), ".\n", sep = ""
) }
Now that we think about it, it might also be nice to have the option to view a table of the results, along with the estimate of the probability:
<- function(reps = 10000, table = FALSE ) {
triangleSim <- runif(reps)
cut1 <- runif(reps)
cut2 <- makesTriangle(cut1, cut2)
triangle if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(triangle))
cat("\n")
}cat(
"The chance of being able to form a triangle\n",
"is approximately ", mean(triangle), ".\n", sep = ""
) }
Let’s give our function a try, using the default of ten thousand repetitions and asking for a table of results:
triangleSim(table = TRUE)
Here is a table of the results:
triangle
FALSE TRUE
7466 2534
The chance of being able to form a triangle
is approximately 0.2534.
Our estimate of the probability of a triangle is pretty close to \(1/4\).
6.4.1 Extra Feature: Option to Set a Seed
There are further possibilities for refining the simulation function. We know that it can be a good idea to set a seed for R’s random-number generator. When you do this you still get random-looking results, but they will be the same results no matter how often you call the simulation function. In that way others who have access to your function can “reproduce” the results you got, thus assuring themselves that you weren’t making anything up. Let’s add a seed
parameter to our simulation function:
<- function(reps = 10000, table = FALSE, seed ) {
triangleSim set.seed(seed)
<- runif(reps)
cut1 <- runif(reps)
cut2 <- makesTriangle(cut1, cut2)
triangle if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(triangle))
cat("\n")
}cat(
"The chance of being able to form a triangle\n",
"is approximately ", mean(triangle), ".\n", sep = ""
) }
This is all very well, but if for some reason you desire to have the function produce different results with every call, you have to come up with the seed yourself. It would be nice to have the option to call your function without having to provide a seed.
This is not difficult to accomplish. We’ll set the default value for seed
to be NULL
. Then when the function begins we’ll check the value of seed
. If it is not NULL
, then the user has provided a seed, and we’ll use that. Otherwise, we’ll let R generate pseudo-random numbers however it likes.
<- function(reps = 10000, table = FALSE, seed = NULL) {
triangleSim if ( !is.null(seed) ) {
set.seed(seed)
}<- runif(reps)
cut1 <- runif(reps)
cut2 <- makesTriangle(cut1, cut2)
triangle if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(triangle))
cat("\n")
}cat(
"The chance of being able to form a triangle\n",
"is approximately ", mean(triangle), ".\n", sep = ""
) }
6.4.2 Program Development
Think about how we developed the triangleSim()
function.
- We began by writing simple code that got us a simulation for a few repetitions.
- When we were sure how to make a simulation work, we encapsulated our code into a function that we could call to perform the simulation for us.
- We thought about features that we would like the function to have—user chooses number of repetitions, user chooses whether to see a table, etc.—and implemented these features one by one. This made the function more general, i.e., useful in a wider range of settings.
The above method of program development is called encapsulation and generalization. For small projects of the sort that we have on hand, it’s a good way to go about writing a computer program.
6.4.3 Practice Exercises
Rewrite the
diceGameSim()
function from Section 6.3.4 so that:- it allows the user to set a seed;
- it has a
table
option to print a proportions table to the Console.
Rewrite the
photonTorpedoSim()
function from the previous section so that it allows the user to set a seed.Extend the
photonTorpedoSim()
just a bit more, by adding two new parameters:shield
: the level at which the Enterpise’s shields are set. (If the strength of the torpedo is less thanshield
, then there is zero damage.) The default value ofshield
should be 5.upper
: the upper bound on the strength of a torpedo-shot. (That is, a tropedo shot will be a random real number from 0 toupper
.) The default value ofupper
should be 10.
The program should stop with a helpful message if the user sets
upper
orshield
to a negative number.`
6.4.4 Solutions to Practice Exercises
- Here’s the code:
<- function(reps, seed = NULL, table = TRUE) {
diceGameSim if (!is.null(seed)) {
set.seed(seed)
}<- c(-1, 0, 3)
outcomes <- c(2 / 6, 3 / 6, 1 / 6)
probs <- sample(
winnings
outcomes,size = reps,
replace = TRUE,
prob = probs
)if (table) {
cat("Here is a table of the results:\n\n")
<- table(winnings)
tab print(round(prop.table(tab), digits = 3))
cat("\n")
}mean(winnings)
}
Let’s try it out:
diceGameSim(reps = 100000, seed = 5252)
Here is a table of the results:
winnings
-1 0 3
0.331 0.501 0.169
[1] 0.17503
- Here’s the code, if you use the
pmax()
function:
<- function(reps, seed = NULL) {
photonTorpedoSim if (!is.null(seed)) {
set.seed(seed)
}<- runif(reps, min = 0, max = 12)
strengths <- pmax(0, strengths - 5)
damage mean(damage)
}
Let’s try it out:
photonTorpedoSim(reps = 100000, seed = 4040)
[1] 2.034732
- Here’s the code:
<- function(reps, seed = NULL, shield = 5, upper = 10) {
photonTorpedoSim if (!is.null(seed)) {
set.seed(seed)
}if (upper < 0) {
return(cat("Upper bound on torpedo strength cannot be a negative number!"))
}if (shield < 0) {
return(cat("Shield strength cannot be set to a negative number!"))
}<- runif(reps, min = 0, max = upper)
strengths <- pmax(0, strengths - shield)
damage mean(damage)
}
Let’s try it out:
photonTorpedoSim(
reps = 100000,
seed = 4040,
upper = 15,
shield = 12
)
[1] 0.2991273
6.5 Example: Will They Connect?
Anna and Raj make a date for coffee tomorrow at the local Coffee Shop. After making the date, both of them forget the exact time they agreed to meet: they can only remember that it was to be sometime between 10 and 11am. Each person, independently of the other, randomly picks a time between 10 and 11 to arrive. If Anna arrives and Raj is not there, she will wait up to ten minutes for him, but will leave if he does not show within that time period. Raj is similarly disposed: he will wait ten minutes—but no more—for Anna. What is the chance that they meet?
The key to designing a simulation procedure is to realize that Anna and Raj will connect if and only if the difference between their arrival times is less than 10 minutes. It doesn’t matter who arrived first, so long as the difference is less than 10. This means that we want to check on the absolute value of Anna’s arrival time minus Raj’s arrival time. The two will meet provide that:
\[\vert\, \text{Anna's tme} - \text{Raj's time}\, \vert < 10.\]
The following code implements this idea:
Let’s try it out:
meetupSim(reps = 100000, table = TRUE, seed = 3939)
Here is a table of the results:
connect
FALSE TRUE
69781 30219
The proportion of times they met was 0.30219.
6.5.1 Practice Exercises
- Just for practice, rewrite a function called
meetupSim2()
that performs the same task asmeetupSim()
, but uses afor
-loop rather than vectorization.
6.5.2 Solutions to the Practice Exercises
- This will do:
<- function(reps = 10000, table = FALSE, seed = NULL) {
meetupSim2 if (!is.null(seed)) {
set.seed(seed)
}#create an empty vector to hold the results:
<- numeric(reps)
connect # loop through:
for (i in 1:reps) {
# get one arrival time for anna:
<- runif(1, 0, 60)
anna # and one for raj:
<- runif(1, 0, 60)
raj #compute result and record in connect:
<- (abs(anna - raj) < 10)
connect[i]
}#the rest is the same as in meetupSim:
if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(connect))
cat("\n")
}cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}
6.6 Example: the Appeals Court Paradox
The following example is discussed in (J.Nahin 2008).
An appeals court generally consists of an odd number of justices, so the court will not be hampered with tie-votes and will thus always be able to render a decision for any case brought before it. Let’s imagine an appeals court that has five members. Let us further imagine that for each case before the court, each of the justices, independently of the others, makes a decision for either one or the other of the two opposing parties. Assume also for the sake of simplicity that there is always one side that is “in the right”. It follows that each judge is making a decision that is either correct or incorrect. The judges then report their decisions to one another, and the decision of the court is determined by majority vote. To be precise: whichever side gets three or more votes is the side that wins.
Although all of the members of the court are pretty sharp, they differ somewhat in their legal abilities:
- In any case that comes before her, Judge A has a 95% chance to decide correctly.
- Judge B has a 94% chance to judge correctly.
- Judges C and D each have a 90% chance to judge correctly.
- Judge E is the weak link on the court, with a mere 80% chance to judge correctly.
We are interested in estimating the probability that the majority opinion of the court will be correct.
Before we write a simulation in full, we should consider how we are going to simulate something that has a specified percentage chance of happening. For example, how would we simulate decisions made by Judge A?
We don’t need to simulate the actual decision in each case, we just need to simulate whether the decision was right or wrong.
Suppose we were to watch Judge A making 20 decisions, recording each time whether she was correct or not. In order to simulate the results with R, we could ask for:
sample(c(0, 1), size = 20, prob = c(0.05, 0.95), replace = TRUE)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
In the above results, 1 stands for a correct decision and the 0 stands for an incorrect decision. We have set the prob
parameter so that there is a 95% chance of 1, on each case.
This would give us a count of the number of times Judge A was correct, but it would not tell us which of the cases she judged correctly and which she judged incorrectly.
Assuming that the court hears 20 cases, we would like to simulate the decision of each judge in each case. We could store the results in variables, as follows:
<- sample(
a c(0, 1), size = 20, prob = c(0.05, 0.95),
replace = TRUE
# Judge A
) <- sample(
b c(0, 1), size = 20, prob = c(0.06, 0.94),
replace = TRUE
# Judge B
) <- sample(
c c(0, 1), size = 20, prob = c(0.10, 0.90),
replace = TRUE
# Judge C
) <- sample(
d c(0, 1), size = 20, prob = c(0.10, 0.90),
replace = TRUE
# Judge D
) <- sample(
e c(0, 1), size = 20, prob = c(0.20, 0.80),
replace = TRUE
# Judge E )
Now we come to the key idea in coding our simulation: in order to determine the total number of correct votes in each case, all we need to do is add the five vectors we created above:
<- a + b + c + d + e correctVotes
The court will decide correctly when the number of votes for the correct option is at least 3. The following logical vector records this:
<- (correctVotes >= 3) courtCorrect
Table 6.2 shows some possible results:
A | B | C | D | E | Number Correct | Court Correct |
---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 | 3 | TRUE |
1 | 1 | 1 | 0 | 1 | 4 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 0 | 1 | 4 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 0 | 4 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 0 | 1 | 4 | TRUE |
1 | 1 | 0 | 1 | 0 | 3 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 0 | 1 | 4 | TRUE |
0 | 1 | 1 | 1 | 1 | 4 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 1 | 5 | TRUE |
1 | 1 | 1 | 1 | 0 | 4 | TRUE |
1 | 1 | 0 | 1 | 1 | 4 | TRUE |
Note that all of the court’s decisions were correct, even though in a couple of cases the correct decisions were barely in the majority.
We now have the basic idea of the simulation. In order to estimate the probability of a correct decision, we simply need to recast our idea in the form of a function that permits us to simulate a very large number of imaginary court cases and that will report the results to us. Here is the code for such a function.
<- function(reps = 10000,
courtSim seed = NULL,
table = FALSE,
probs = c(0.95, 0.94, 0.90, 0.90, 0.80)) {
if (!is.null(seed)) {
set.seed(seed)
}
# get the probabilities
<- probs[1]
aProb <- probs[2]
bProb <- probs[3]
cProb <- probs[4]
dProb <- probs[5]
eProb
# simulate decisions of each judge:
<- sample(
a c(0, 1),
size = reps, prob = c(1 - aProb, aProb),
replace = TRUE
# Judge A
) <- sample(
b c(0, 1),
size = reps, prob = c(1 - bProb, bProb),
replace = TRUE
# Judge B
) <- sample(
c c(0, 1),
size = reps, prob = c(1 - cProb, cProb),
replace = TRUE
# Judge C
) <- sample(
d c(0, 1),
size = reps, prob = c(1 - dProb, dProb),
replace = TRUE
# Judge D
) <- sample(
e c(0, 1),
size = reps, prob = c(1 - eProb, eProb),
replace = TRUE
# Judge E
)
# count the number of correct votes in each case:
<- a + b + c + d + e
correctVotes
# determine whether court decided correctly, in each case:
<- (correctVotes >= 3)
courtCorrect
# record results
if (table) {
cat("Here is a table of the results:\n\n")
print(table(courtCorrect))
cat("\n")
}cat(
"The proportion of times the court was correct was ",
mean(courtCorrect), ".\n",
sep = ""
) }
Let’s now estimate the probability of the court rendering a correct verdict, using one hundred thousand simulated cases:
courtSim(reps = 100000, seed = 3838, table = TRUE)
Here is a table of the results:
courtCorrect
FALSE TRUE
787 99213
The proportion of times the court was correct was 0.99213.
The court seems to be doing quite well! But of course this is not very surprising: after all, most of the judges make the correct decision almost all the time. It is interesting, though, that the chance of the full court rendering a correct verdict is higher than the chance of any individual judge to be correct. There appears to be a benefit to the voting procedure.
But things get even more interesting if we imagine that the judges recognize, after some time, that Judge E simply isn’t as sharp as Judge A, and that they pressure him into voting whichever way Judge A votes. “Surely”, they reason, “since Judge A is so good this new policy will increase our chance to hand down a correct verdict!”
We will simulate to see if the judges’ reasoning is correct. All we need to do is to count Judge A’s vote twice in the sum, and not count Judge E’s vote at all:
<- 2*a + b + c + d correctVotes
Here is the new simulation-function:
<- function(reps = 10000,
courtSim2 seed = NULL,
table = FALSE,
probs = c(0.95, 0.94, 0.90, 0.90)) {
if ( !is.null(seed) ) {
set.seed(seed)
}
# get the probabilities
<- probs[1]
aProb <- probs[2]
bProb <- probs[3]
cProb <- probs[4]
dProb
# simulate decisions:
<- sample(
a c(0, 1),
size = reps, prob = c(1 - aProb, aProb),
replace = TRUE
# Judge A
) <- sample(
b c(0, 1),
size = reps, prob = c(1 - bProb, bProb),
replace = TRUE
# Judge B
) <- sample(
c c(0, 1),
size = reps, prob = c(1 - cProb, cProb),
replace = TRUE
# Judge C
) <- sample(
d c(0, 1),
size = reps, prob = c(1 - dProb, dProb),
replace = TRUE
# Judge D
)
# count the number of correct votes in each case:
<- 2*a + b + c + d
correctVotes
# determine whether court decided correctly, in each case:
<- (correctVotes >= 3)
courtCorrect
# record results
if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(courtCorrect))
cat("\n")
}cat("The proportion of times the court was correct was ",
mean(courtCorrect), ".\n", sep = "")
}
Let’s try it out:
courtSim2(reps = 100000, seed = 3838, table = TRUE)
Here is a table of the results:
courtCorrect
FALSE TRUE
1221 98779
The proportion of times the court was correct was 0.98779.
Hey! That’s a lower chance of success than before! The difference is small but significant: compelling the weakest judge to vote with the strongest judge actually decreased the court’s chance of rendering a correct verdict overall. This circumstance is sometimes called the Appeals Court Paradox, but it occurs in many other practical situations. More often than you might think, the benefit of independent voting can outweigh the advantage associated with relying solely on a small number of “experts.”
6.7 Example: How Many Numbers Needed?
Let us pause for a review.
In the last few Sections we have dealt with estimation of a probability: the chance for a particular event to occur. We have accomplished this with Monte Carlo simulation:
- We repeated the chance process a very large number of times
- On each repetition, we recorded whether or not the event occurred.
- After we finished all of the repetitions, we computed the proportion of those repetitions in which the event did occur.
- This proportion served as our estimate of the probability for the event to occur.
- We know that the more repetitions we arrange for, the closer our estimate is liable to be to the actual probability for the event to occur.
In this Section we will return to the task of estimating the expected value of a random variable. Recall that a random variable is a number whose value is the result of some chance process. The expected value of the random variable is the average—the mean—of its values over many, many repetitions of that chance process.
Here’s the scenario. Consider the following chance process:
- Pick a real number \(x_1\) between 0 and 1 at random.
- Pick another real number \(x_2\) at random (also between 0 and 1). Add it to \(x_1\).
- If the sum \(x_1 + x_2\) exceeds 1, then stop.
- Otherwise, pick a third real number \(x_3\), and add it to the sum of the previous two.
- If the sum \(x_1+x_2+x_3\) exceeds 1, then stop.
- Otherwise, pick a fourth real number …
The idea is to keep on going in this way until the sum of the numbers you have picked exceeds 1.
The random variable we are interested in is the number of numbers we have to pick to make the sum of those numbers exceed 1. Let us call this random variable \(X\). As you can see, this number could be 2, or 3, or 4, or even more depending on how small the random numbers you pick happen to be.
In order to fully grasp what is going on, it will be helpful to go through the process of computing \(X\). The following function will do this for us:
<- function(verbose = FALSE) {
numberNeeded <- 0
mySum <- 0
count while( mySum < 1 ) {
<- runif(1)
number <- mySum + number
mySum <- count + 1
count if ( verbose ) {
cat("Just now picked ", number, ".\n", sep = "")
cat(count, " number(s) picked so far.\n", sep = "")
cat("Their sum is ", mySum, ".\n\n", sep = "")
}
}
count }
Call the function a few times, like this:
numberNeeded(verbose = TRUE)
Just now picked 0.2869393.
1 number(s) picked so far.
Their sum is 0.2869393.
Just now picked 0.7092682.
2 number(s) picked so far.
Their sum is 0.9962075.
Just now picked 0.7524099.
3 number(s) picked so far.
Their sum is 1.748617.
[1] 3
As you can see, it’s quite common to need 2 or 3 numbers to make the sum exceed 1, but sometimes, you need 4, 5 or even more numbers.
Since we are interested in the expected value of \(X\), we should repeat the process a large number of times and compute the mean of the \(X\)-values that we find. The following code repeats the process 1000 times:
<- numeric(1000)
needed for (i in 1:1000 ) {
<- numberNeeded()
needed[i]
}cat("The expected number needed is about ",
mean(needed), ".\n", sep = "")
The expected number needed is about 2.697.
Of course the expected value indicates what \(X\) will be on average. By itself it doesn’t tell us about the variability in \(X\)—how much \(X\) tends to “hop around” from one repetition to another. In order to get a sense of the variability of \(X\), we can make a table of the results of our 1000 repetitions:
table(needed)
needed
2 3 4 5 6 7
528 303 123 37 8 1
Sure enough, we usually got 2 or 3, but higher values are certainly possible.
For the most part we are less interested in the number of times we got each possible value than we are in the proportion of times we got each possible value. For this all we need is prop.table()
:
prop.table(table(needed))
needed
2 3 4 5 6 7
0.528 0.303 0.123 0.037 0.008 0.001
The proportions serve as estimates of the probability of getting each of the possible values of \(X\). Such a table of proportions gives us a sense of the distribution of \(X\)—the chances for \(X\) to assume each of its possible values.1
Now that we have the basic idea of how to accomplish the simulation, it remains to package up our code into a nice function that will enable a user to set the number of repetitions, set a seed, and choose reporting options. Along the way we will generalize a bit, allowing the user to set a target sum other than 1. We’ll also return our estimate of the expected number needed invisibly, so that the function can be used in the context of larger-scale investigations (see the exercises from this Chapter).
Let’s test the function with ten thousand repetitions:
numberNeededSim(target = 1, reps = 10000, seed = 4848, report = TRUE)
needed
2 3 4 5 6 7 8
0.4937 0.3350 0.1313 0.0317 0.0071 0.0010 0.0002
The expected number needed is about 2.7273.
[1] 2.7273
Does the estimate of the expected number of numbers look a bit familiar? Mathematicians have proven that the exact expected value—what you would get for an average if you could do more and more repetitions, without limit—is the famous number \(e\) (see Note 6.1).
In the exercises you will be asked to investigate the expected value when the target is a number other than 1.
6.7.1 Practice Exercises
Scarecrow is walking through a meadow filled with many, many flowers. 15% of them are blue. He picks flowers until he has got two blue ones. Let \(X\) be the number of flowers that he must pick in order to get two blue flowers. clearly, \(X\) is a random variable. What is its expected value? Write a function called
pick_sim()
to estimate the expected value. It should take two parameters:reps
: the number of times to simulate picking until he gets two blues. The default should be 10,000.seed
: a seed that the user can set. The default value should beNULL
.
The function should return its estimate of the expected value of \(X\).
Abe and Bo engage in a duel. They take turns shooting at each other until one person manages to hit the other, thus winning the dual. Abe has a 40 percent chance of hitting with each shot he takes, whereas Bo has a 60 percent chance of hitting on each of his shots. Since Abe is the weaker shooter, Bo generously permits Abe to go first. We would like to estimate Abe’s chance of winning the duel. Write a function called
dual()
to estimate this probability. The function should have two parameters:reps
: the number of times to simulate a dual. The default should be 10,0000.seed
: a seed that the user can set. The default value should beNULL
.
6.7.2 Solutions to the Practice Exercises
- Here is the function:
<- function(reps = 10000, seed = NULL) {
pick_sim if (!is.null(seed)) {
set.seed(seed)
}## set up numerical vector to hold results:
<- numeric(length = reps)
results for (i in 1:reps) {
## set up the walk:
<- 0
blue_count <- 0
pick_count <- TRUE
picking while (picking) {
<- sample(c("blue", "not"), size = 1, prob = c(2/5, 3/5))
flower <- pick_count + 1
pick_count if (flower == "blue") {
<- blue_count + 1
blue_count if (blue_count == 2) picking <- FALSE
}
}<- pick_count
results[i]
}mean(results)
}
Let’s try it:
pick_sim(seed = 2020)
[1] 5.0416
In a probability course one can show that the expected value is exactly 5.
- Here’s the function:
<- function(reps = 10000, seed = NULL) {
dual if (!is.null(seed)) {
set.seed(seed)
}# set up logical vector to store results of each simulation:
<- logical(reps)
abeWins for (i in 1:reps) {
<- TRUE
dualing <- 1
shotNumber while (dualing) {
# if i is an odd number, then Abe is shooting:
<- shotNumber %% 2 == 1
abeToShoot # probability of hitting depends on who is shooting:
<- ifelse(abeToShoot, 0.4, 0.6)
hitProb # shooter takes a shot:
<- sample(c("hit", "miss"),
resultOfShot size = 1,
prob = c(hitProb, 1 - hitProb))
if (resultOfShot == "hit") {
<- ifelse(abeToShoot, TRUE, FALSE)
abeWins[i] <- FALSE
dualing
}<- shotNumber + 1
shotNumber
}
}# count up how many times abe won, divide by number of plays,
# and return:
sum(abeWins)/reps
}
Let’s try it out:
dual(reps = 100000, seed = 3939)
[1] 0.52441
Looks like Abe has a slight advantage!
6.8 More in Depth
6.8.1 Expected Value for Discrete Random Variables
Let’s revisit the Coin Toss Game (Section @ref(a-coin-toss-game)).
Recall the rules: you flip a fair coin twice.
- If you get Tails both times, you lose a dollar.
- If you get exactly one Head, nothing happens.
- If you get Heads both times, you win two dollars.
Let \(W\) be the number of dollars you will win. Thinking about the possibilities for the outcome of flipping a fair coin twice, we came up with the distribution of \(W\):
- The chance that \(W=-1\) is 0.25.
- The chance that \(W=0\) is 0.50.
- The chance that \(W=2\) is 0.25.
We wrote simulations to estimate the expected value of \(W\).
However, when we know the distribution of a random random variable, then we can calculate its expected value exactly—no simulations required!
All we have to do is to compute the weighted average of the possible values of the random variable. In this case it is:
\[0.25 \times -1 + 0.50 \times 0 + 0.25 \times 2.\]
Each possible value of the random variable is multiplied by the probability of getting that value, and the products are then added up. For \(W\), this works out to 0.25, or 25 cents.
You could have R compute it for you:
<- c(-1, 0, 2)
outcomes <- c(0.25, 0.50, 0.25)
probs <- sum(probs * outcomes)
expected_value expected_value
[1] 0.25
6.8.2 More on the Law of Large Numbers.
The Law of Large Numbers says that the more times you repeat the simulation, the better our estimate of the desired probability—or expected value—is liable to be. Let’s use the Stick-Splitting Problem (Section 6.4) to investigate the effect of the choice of repetitions on the accuracy of our estimate of the probability of getting a triangle.
We will need to run the triangleSim()
function several times, with a different choice for reps
each time, keeping track of the estimates we obtain. In the process we won’t need tables or other output to the console. Let’s rewrite triangleSim()
for maximum flexibility, making console output optional and using the invisible return discussed in Section @ref(invisible-returns).2
<- function(reps = 10000, table = FALSE, seed = NULL,
triangleSim verbose = TRUE) {
if ( !is.null(seed) ) {
set.seed(seed)
}<- runif(reps)
cut1 <- runif(reps)
cut2 <- makesTriangle(cut1, cut2)
triangle if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(triangle))
cat("\n")
}if ( verbose ) {
cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}invisible(mean(triangle))
}
Next we make a vector of repetitions, and a vector to hold our estimates:
<- c(100, 1000, 10000, 100000, 1000000)
reps <- numeric(length(reps)) estimates
Next we loop through reps
, performing a complete simulation for each repetition-number and storing our results:
for ( i in 1:length(reps)) {
<- triangleSim(reps = reps[i], seed = 3030,
estimates[i] verbose = FALSE)
}
Let’s have a look at our estimates:
names(estimates) <- reps
estimates
100 1000 10000 1e+05 1e+06
0.200000 0.245000 0.254700 0.251570 0.249779
The estimates are indeed getting closer to the theoretical value of \(1/4\).3
6.8.3 Simulating Binomial Random Variables
Suppose that a weighted coin has a 60% chance to lands heads on each flip. You flip it 30 times and count up the number of Heads you get. How might such a process be simulated in R?
One approach makes use of sample()
. First, get the outcome of each flip, coding it as 1 the coin lands Heads, 0 if it lands Tails:
set.seed(2020)
<- sample(
outcomes c(0, 1), size = 30, prob = c(0.40, 0.60),
replace = TRUE
)
Then sum the 0-1 outcomes:
sum(outcomes)
[1] 18
That’s how many Heads you got.
Of course there is nothing wrong about doing it all with a single expression (it’s just a little bit hard to follow):
sum(
sample(
c(0, 1),
size = 30,
prob = c(0.4, 0.6),
replace = TRUE
) )
[1] 18
Another—and more convenient—approach is to use the function rbinom()
. This function also simulates the results of flipping a coin a number of times and counting how many heads one gets. The general form of a call to rbinom()
is as follows:
rbinom(n, size, prob)
In this call:
size
is how many times you plan to flip the coin;prob
is the chance on any flip that the coin will turn up heads;n
is how many times you plan to repeat the process of flipping the coinsize
times (counting up the number of heads each time).
In order to simulate flipping our coin thirty times, you would ask for:
rbinom(n = 1, size = 30, prob = 0.6)
[1] 17
This time we got 17 Heads.
If you want to go through the above process twenty times instead of just doing it once, you could ask for:
rbinom(n = 20, size = 30, prob = 0.5)
[1] 16 14 16 15 12 11 12 14 7 16 17 17 18 14 14 15 20 16 15 13
Nothing in rbinom()
says that it addresses only coin-flipping. In general it serves as a model for any situation in which:
- there is a fixed number of trials (e.g., flip of a coin, taking a free-throw shot, deciding about a court case);
- each trial has two possible outcomes, often termed Success or Failure (e.g., coins lands heads or tails, you make the shot or you don’t, you decide the case correctly or you don’t);
- the chance of the Success outcome is the same, on any trial (e.g., the coin always has the same chance of coming up heads, for any case Judge A has a 95% chance of being right);
- the outcome of any trial is independent of the outcome of all other trials (e.g., making the first free throw does not increase or decrease one’s chances of making the next free throw, etc.)
- you are counting up the number of successes in your trials.
In such a situation the count of the number of successes is called a binomial random variable.4 The size
parameter in rbinom()
gives the fixed number of trials. The prob
parameter specifies the chance of success on each trial.
6.8.4 R’s Pipe Operator |>
Earlier we simulated a binomial random variable with the following expression:
sum(
sample(
c(0, 1),
size = 30,
prob = c(0.4, 0.6),
replace = TRUE
) )
The expression is a bit difficult to understand because it has to be read “inside-out”:
- We begin inside, concatenating 0 and 1 (
c(0, 1)
). - Then we sample 30 times from from the vector we formed (
sample()
). - Then we add our samples (
sum()
).
In R, you can use the “pipe” operator |>
to write it top-down, like this:
c(0, 1) |>
sample(size = 30, prob = c(0.4, 0.6), replace = TRUE) |>
sum()
[1] 16
The reason this works is that the pipe |>
takes the value of the expression before it and uses it as the first argument of the expression after it.
Another example:
"Hello" |> rep(4) ## same as rep("hello", times = 4)
[1] "Hello" "Hello" "Hello" "Hello"
If you want to pipe the value of the left-hand side into someplace other than the first argument, then you can use the underscore _
as a placeholder to show where it should go:
4 |> rep("Hello", times = _) ## same as rep("hello", times = 4)
[1] "Hello" "Hello" "Hello" "Hello"
But this doesn’t work:
"Dorothy" |> cat("Hello, ", _, "!\n", sep = "")
Error in cat("Hello, ", "_", "!\n", sep = ""): pipe placeholder can only be used as a named argument (<input>:1:14)
6.8.5 Practice Exercises
Use
rbinom()
to simulate taking twenty free throws, where you have a 70% chance to make each shot. Your simulation should produce a vector of 20 ones and zeros, where 1 stands for making the shot and 0 stands for missing it.Use
rbinom()
to simulate taking twenty free throws, where you have a 70% chance to make each shot, and counting up how many shots you make. Your simulation should produce a vector of length one that represents the number of shots you made.Every morning for 100 mornings in a row you take twenty free throws. You have a 70% chance to make each shot, and you always count up how many shots you make. Use
rbinom()
to simulate the counts. Your simulation should produce a vector of length 100 where the first element represents the number of shots you made on day 1, the second element represents the number of shots you made on day 2, etc.You have a 70% chance to make each free throw shot you take. Your friend Lester has a 60% chance to make each shot. Both of you will take 20 shots. You are interested in knowing the probability that you will make more shots than Lester does. Write a function called
freeThrowSim()
that will help you estimate this probability. The function should take two parameters:seed
, so that the user can set a seed;reps
, the number of times to simulate a competition between you and Lester.
The function should return the estimated probability.
For the scenario of Anna and Raj (Section 6.5), enhance the function
meetupSim2()
so that it has averbose
parameter. Whenverbose
isTRUE
, the function should behave as it currently does, but also return the estimated probability invisibly. (See Section 4.5.3.) The default value ofverbose
should beTRUE
. Typical examples of use would be as follows:
<- meetupSim2(reps = 1000, verbose = TRUE, seed = 2020) res
Here is a table of the results:
connect
FALSE TRUE
739 261
The proportion of times they met was 0.261.
cat("The function returned ", res, ".\n", sep = "")
The function returned 0.261.
meetupSim2(reps = 1000, seed = 2020)
[1] 0.261
6.8.6 Solutions to the Practice Exercises
- Try this:
rbinom(20, size = 1, prob = 0.70)
[1] 1 0 1 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 1
- Try this:
rbinom(1, size = 20, prob = 0.70)
[1] 15
- Try this:
rbinom(100, size = 20, prob = 0.70)
- Here’s one way to write the function, if you decide to use a loop:
<- function(reps, seed, shots,
freeThrowSim you = 0.70, lester = 0.60) {
<- logical(reps)
youWin for ( i in 1:reps ) {
<- rbinom(1, size = shots, prob = you)
youMake <- rbinom(1, size = shots, prob = lester)
lesterMakes <- (youMake > lesterMakes)
youWin[i]
}mean(youWin)
}
Here’s a way to write it, if you vectorize:
<- function(reps, seed, shots,
freeThrowSim you = 0.70, lester = 0.60) {
<- rbinom(reps, size = shots, prob = you)
youMake <- rbinom(reps, size = shots, prob = lester)
lesterMakes <- (youMake > lesterMakes)
youWin mean(youWin)
}
Let’s try it:
freeThrowSim(reps = 100000, shots = 20, seed = 3535)
[1] 0.69637
You have nearly a 70% chance to make more shots than Lester.
- Here is such a function:
<- function(reps = 10000, verbose = FALSE, seed = NULL) {
meetupSim2 if (!is.null(seed)) {
set.seed(seed)
}<- runif(reps, 0, 60)
anna <- runif(reps, 0, 60)
raj <- (abs(anna - raj) < 10)
connect <- mean(connect)
estimate if (verbose) {
cat("Here is a table of the results:\n\n")
print(table(connect))
cat("\n")
cat(
"The proportion of times they met was ",
".\n",
estimate, sep = ""
)invisible(estimate)
else {
}
estimate
} }
6.9 The Main Ideas of This Chapter
- The probability of an event vs. the expected value of a random variable.
- To estimate the probability that an event occurs, use looping or some other means to generate a logical vector of simulated results (call it
results
if you like) whereTRUE
means the event occurred andFALSE
means that it did not occur. Thenmean(results)
computes the proportion of times that the event occurred, so it will be your estimate of the probability that the event occurred. - To estimate the expected value of a random variable, use looping or some other means to generate a logical vector of results (call it
results
if you like) where each element is a simulated value of the random variable. Thenmean(results)
will be your estimate of the expected value of the random variable. - In R, vectorization usually works a lot faster than looping. Use it when you can.
- This chapter is mostly about flow-control—especially looping—as applied to simulation of random processes, so pay special attention to examples where
for
-loops andwhile
-loops were used. - Pay special attention to how we created and refined functions through the encapsulation and generalization process, especially:
- options to add tables and other information beyond the basic results;
- options for handling random seeds, implemented with:
if (!is.null(seed)) { set.seed(seed) }
- This Chapter provides lots of practice with two familiar randomizing functions:
sample()
runif()
Links to Class Slides
Quarto Presentations that I sometimes use in class:
Glossary
- Probability of an Event
-
The long-run proportion of times that the event will occur if the underlying random process is repeated many, many times.
- Random Variable
-
A number whose value depends on the outcome of a chance process.
- Expected Value of a Random Variable
-
The long-run average of the values of a random variable, where the underlying chance process is repeated many, many times.
- Continuous Random Variable
-
A random variable that can assume any value in a range of real numbers.
- Distribution of a Random Variable
-
A statement of the probabilities for a random variable to assume its various possible values.
- Encapsulation and Generalization
-
A method for developing a computer program according to which the programmer first designs a basic procedure, then encapsulates it in one or more functions, and then progressively generalizes these functions until the program possesses all desired features.
Exercises
Exercise 1
Consider the following two games.
- In Game A you flip a fair coin. If the coin comes up Heads you get two dollars, whereas if it comes up Tails you get one dollar.
- In Game B you roll a fair die. If the six-spot comes up, you win twenty-five dollars. If you get 2, 3, 4, or 5, nothing happens. If the one-spot comes up, you lose fifteen dollars.
In this exercise you need to answer two questions:
- If you could choose to play either Game A or Game B just once, which game would you prefer to play, and why?
- Suppose that you can choose either to play Game A ten thousand times, or Game B ten thousand times. Which choice would you prefer, and why?
Answers to the first question will vary from person to person, depending on circumstances and personal taste. On the other hand, astute consideration of expected values leads most people to answer the second question in the same way. Therefore, before you answer the question please write two simulation functions: one called GameA
that provides an estimate of the expected value of the winnings from a play of Game A, and another called GameB
that estimates the expected value of the winnings from a play of Game B. The functions should have two parameters:
- a
reps
parameter to permit the user to choose the number of repetitions of the game; - a
seed
parameter to permit the user to choose a starting seed.
Use both functions with fifty thousand reps, in order to obtain estimates for the expected amount of winnings in a single play of each game. Use a simple starting seed for each function. Report the seeds you used, along with the estimated expected values for each game. Then use your results to decide which game is the better choice to play ten thousand times.
Note: Please keep in mind that the fifty-thousand figure is just the number of times you repeat your simulations in order to estimate the expected value of a single play of one of the games. It’s not how many times you actually get to play either game—that’s ten thousand, in the second question.
Exercise 2
Reconsider the meeting of Anna and Raj (Section 6.5). Suppose that they still plan to meet during the same one-hour period, but that Anna is willing to wait \(a\) minutes and Raj is willing to wait \(r\) minutes, where \(a\) and \(r\) could differ. Modify the meetupSim()
function (see Listing 6.1) so that it accepts two additional parameters: * a
: the time Anna is willing to wait; * r
: the time Raj is willing to wait;
Apply the function with ten thousand repetitions (and a simple seed that you report) in order to estimate the probability of a meeting, assuming that Anna is willing to wait 13 minutes and Raj is willing to wait 7 minutes.
Hint: Suppose that Anna arrvies at time \(anna\) and Raj arrives at times \(raj\). We want a simple mathematical expression that gives the conditions for Anna and Raj to connect. In the example in the text, this was:
\[\vert\ anna - raj \ \vert < 10\]
because both Anna and Raj were willing to wait the same amount of time: ten minutes. Now we have to think a bit more carefully when Anna waits \(a\) minutes and Raj waits \(r\) minutes, where \(a\) and \(r\) may differ. You might start by observing that we must have
\[anna \le raj + r,\]
for if not then Raj would arrive and leave before Anna gets there. Also, we would have to have
\[raj \le anna + a,\]
for if not then Anna would arrive and leave before Raj gets there. But if we have both of those conditions:
\[anna - \le raj + r \text{ and } raj \le anna + a\]
then they must connect, since neither person will arrive too much before the other.
Exercise 3
Review the function dual()
in the Practice Exercises of this Chapter (see sec-practicenumbersneeded). Modify the function as follows:
- Add a parameter
abe
that gives the probability for Abe to hit when he shoots, and a parameterbo
that gives the chance for Bo to hit when he shoots. the probabilities should be expressed as numbers between 0 and 1. (Thus, a 30% chance of hitting is 0.30.) - Make the function estimate the expected value of the number of shots taken in a dual, rather than the chance for Abe to win.
When you have your function working, use it to estimate the expected number of shots when:
- Abe’s chance of hitting is 50 percent and Bo’s chance if 50%;
- Abe’s chance of hitting is 20% and Bo’s chance of hitting is 25%.
In each simulation, use one hundred thousand repetitions and a seed that you choose.
Exercise 4
A candy-maker makes a particular type of candy that always has 20 candies in each bag. When you buy a bag of this candy, you take a few candies from it every day until they are all gone. The amount you take each day is a random whole number from 1 to 10. A possible scenario is:
- Day 1: take 4 candies. (There are 16 left.)
- Day 2: take 8 candies. (There are 8 left.)
- Day 3: take 3 candies. (There are 5 left.)
- Day 4: take 6 candies. (There were only 5 left, so you take all the remaining candies.)
In the scenario above, it took you four days to finish the bag.
Write a function called daysNeededSim()
that will estimate the expected number of days required to empty a bag. It should take two parameters:
reps
: the number of times to simulate the process of emptying one bag. The default value should be 10,000.seed
: a random seed that the user can set. The default value should beNULL
.
Apply the function with 50,000 repetitions and a seed that you report, in order to estimate the expected number of days required to empty the bag.
Hint: This scenario is quite similar to that of the Number-Needed problem (Section 6.7. Begin by reviewing it.
Exercise 5
A gambler starts with 10 dollars, and repeatedly plays a game. If she wins the game she gets one dollar and if she loses the game she loses one dollar. Her chance of winning each game is a fixed number that is less than 0.50. Write a function to simulate the gambler repeatedly playing the game until her money runs out. Write the function so that it has
- a
start
parameter for the initial amount of money the gambler has (the function should stop if the user provides a value less than 0); - a
seed
parameter (defaultNULL
); - a
p
parameter for the chance of winning, so that the user can enter any chance less than or equal to 0.5. (The function should stop the user if the user enters more than 0.5.)
The function should return the number of times she plays until she is broke. (Include in this count the play when she goes broke.).
Apply the function with a simple seed that you report, a starting amount of ten dollars, and a chance of winning equal to 0.45.
Exercise 6*
Refer back to the numberNeededSim()
function (see Listing 6.2). Write a program that computes the estimated expected number needed, for the following targets:
<- seq(0.05, 0.95, by = 0.05) targets
Each estimate should be based on ten thousand repetitions. Start with a simple seed that you report. Report the estimated expected values for all of the required targets. Compare these estimates with exp(targets)
(the number \(e\) raised to the power of each of the targets.) Formulate a conjecture about the value of the expected number of numbers needed, when the target is a real number between 0 and 1.
Hint: Recall that numberNeededSim()
function returns its estimate of the expected number needed. Hence you could use it inside a for
-loop that iterates through the elements of the targets
vector above, something like this:
<- seq(0.05, 0.95, by = 0.05)
targets <- length(targets)
n <- numeric(n)
results set.seed(3030) # or some other simple seed that you like
for ( i in 1:n ) {
# use numberNeededSim() with:
# * 10000 reps
# * seed left at the default NULL (you provided one already)
# * target set to targets[i]
# * report left at FALSE (no need to have R talk to you at each step)
# make sure to store the estimate in results[i]
}# after the loop, you can compare results with exp(targets)
Exercise 7*
A pipe-smoker has two boxes of matches in his pocket. Both boxes contain 40 matches. Every time he lights his pipe, the smoker reaches into his pocket and randomly picks one of the boxes, pulling a match from the box. Eventually one of the boxes runs out. What is the expected value of the number of matches remaining in the other box at that time? Write a simulation function to estimate the answer. Your function should have at least a reps
and a seed
parameter. Apply the function with ten thousand repetitions—and with a simple seed that you report—in order to estimate the expected value.
Exercise 8*
An absent-minded professor walks to and from campus every day. Prior to every commute—either from home to office or the reverse—she checks the weather, taking an umbrella with her if it is raining. The problem is that she never takes an umbrella when the weather is dry, so even though she owns quite a few umbrellas the time comes eventually when it is raining but all of her umbrellas are at her destination, so that she will have to put up with a wet commute.
Suppose that there are \(x\) umbrellas at her house and \(y\) umbrellas at her office, and that she begins at home. Suppose further that the chance of rain on any commute is \(p\), a number between 0 and 1.
Write a simulation function to estimate the expected number of dry commutes that she will enjoy before her first wet commute. The function should have five parameters:
x
: the initial number of umbrellas at home (default 1);y
: the initial number of umbrellas at the office (default 1);p
: the probability of rain at any commute (default 0.5);seed
: an seed for the randomizers (defaultNULL
).reps
: the number of times to simulate commuting until wet.
Apply the function with x
and y
set at three umbrellas, for the values 0.1, 0.5 and 0.9 for p
. Use 10,000 repetitions for each simulation, with seeds that you report. Which value of p
results in the smallest expected number of dry commutes?
Of course not all of the possible values for \(X\) appear on the table: the only values appearing are those that happened to occur in our simulation. The values that did not occur probably have a very small (but nonzero) chance of occurring.↩︎
The addition of the invisible return may be thought of as yet another step in the encapsulation-and-generalization process of program development!↩︎
Probability theory tells us that when we use Monte Carlo simulation to estimate a probability or an expected value, then the likely size of the error in the approximation is inversely proportional to the square root of the number of repetitions we use. From this it follows that to cut the likely size of the error in half you should increase the number of repetitions by a factor of 4. If you want the likely size to be one-third as big, increase repetitions by a factor of 9, and so on.↩︎
“Binomial” is borrowed from a Greek words whose literal meaning is “two names”—corresponding to the two possible outcomes of a trial.↩︎