Now that you've got the basic Monte Carlo infrastructure in place (the basic RandomVariable
implementations for random distributions plus arithmetic expressions built out of random variables) you're ready to start building an actual Monte Carlo model.
There are three basic steps to building a Monte Carlo model.
First, we need to clearly state the question we are attempting to answer and define what an answer looks like. For instance a question might be something like, “How long will this project take?” and the answer might be in the form of a 90% prediction interval on the total time. But the question must be one that has an answer. Something like “I’m going to simulate the game of Blackjack” is not a question. Simulating Blackjack may allow us to answer some question, but until we know what our question is we’re not ready to build a model.
Second, we need to identify the sources of uncertainty. If we’re trying to answer the question of how long a project will take, some of the sources of uncertainty might be how long the different parts of the project will each take plus things like how often events will occur that delay completion. (E.g. in a robotic project, if the robot breaks and can’t be tested, that will probably delay the project. The amount of the delay will also be uncertain.) In our model, we represent each source of uncertainty with a random variable with some distribution.
Finally, we need to define how all the uncertain values are combined to produce a final answer. That is, if we had exact values for all the sources of uncertainty, how would we compute the final answer. This is the bulk of the model.
For instance, if we have a project with two parts, planning and execution, the sources of uncertainty are how long each part is going to take individually and a trivial model of how long the whole project is going to take would be to sum the two times. If we put in different values for the planning and execution time we’ll get different totals out. The key idea behind Monte Carlo models is that by generating many random sets of input values we can then get many different possible outputs and say things about how those results are distributed. For instance, what interval captures 90% of the results? Or what are the probabilities of certain outcomes?
So that’s the basic structure. Within the general category of Monte Carlo simulations, there are a few levels of complexity of model we can choose to build.
The simplest model worth using the Monte Carlo technique on is one that simply combines two or more underlying random distributions mathematically. A simple example would be predicting/estimating the number of minutes of class time we will lose to fire alarms modeled as the total number of fire alarms in the year multiplied by the time spent on the football field for each alarm.
If we had 90% prediction intervals for each of those quantities and assumed some distribution (normal or log normal probably) we can build a “model” of which one run consists of generating a random value for each of the two quantities and multiplying them together to get the total number of minutes lost in that run.
Then we can run that simulation a few hundred thousand times and collect the resulting values and answer all kinds of questions about the distribution of the results: what is the average number of minutes lost? What is the 90% prediction interval on the number of minutes lost? What is the probability that we will lose more than 2,000 minutes? (To answer the last question we just count the number of results over 2,000 and divide by the total number of runs.)
Some of these—for this simple model—we could compute analytically, without running a simulation at all. For instance the average of the product of two independent random variables (variables whose values don’t depend on each other) is the product of their averages. And the standard deviation can also be computed directly, assuming they are independent. But as models get more complex those analytic techniques become harder and harder to apply whereas throwing computing power at simulations is easy.
Another way to model the fire alarm evacuation time that might be easier to wrap our heads around is instead of estimating how many fire alarms there will be in a year, to estimate the probability that we’ll have an alarm on any given day.
Then, instead of simply generating two random values as in the previous model, we simulate a year’s worth of school by simulating 180 school days and for every day generating a uniformly distributed random number between 0 and 1. If that number is less than our estimated probability of a fire alarm, that means there was a fire alarm that day and we generate a random duration for the length of the evacuation and add it to our total. After we have simulated all 180 days of the year, the total is the result of one run of the simulation. Then repeat that process a few hundred thousand times and analyze the results as before.
This model is still simple enough that we could solve it analytically: in fact we can approximate the number of fire alarms that would occur given a daily probability with a normal distribution whose mean and standard deviation are a function of the probability and the number of days. Or we can just run a Monte Carlo simulation and figure it out in a more brute force way.
Note, however, that this model, while still quite simple, requires more computation than the first model: in the first model we just need to generate two random numbers and multiply them for each run; in this model we need to generate 180 random numbers to determine whether there’s an alarm, then up to 180 more random numbers for the lengths of each evacuation, and add them all up, just for one run. None of which is going to take very long on a modern computer but it is worth understanding that there’s a trade off where more complex models may be easier to describe and build but also take longer to run.
The other problem with the model described in the last section is that it is only as good as our estimates of the two inputs. If we could accurately estimate the number of fire alarms in a year and could accurately estimate how long each evacuation would take, we could also likely make a pretty good estimate directly of how many total minutes we’d spend evacuated.
More likely, if we are making good estimates (in the sense of truly capturing our uncertainty with a sufficiently wide 90% prediction interval) the resulting prediction interval of the combination would be very wide; maybe so wide as to be useless.
One of the reasons the Monte Carlo technique is useful is that it lets us break down our model into finer-grained estimates which are easier to estimate more narrowly and then we can let the simulation figure out all the ways those different estimates can combine. For instance, we might have reason to believe that the probability of a fire alarm is different in different periods. Or maybe we have reason to believe that the probability of a fire drill on any given day isn’t actually independent of prior days—maybe we want to account for the probability that if fire alarms happen frequently that students will be more likely to pull them, so each alarm increases the chance of future alarms.
To model the different probabilities per period we just have to change our simulation so each run consists of simulating each period of each day, rolling the dice to determine if there’s an alarm and then totaling up the total time spent as before. (This model would also allow the possibility of multiple alarms in one day, which the prior one was too simple account for.)
To model a changing probability of alarms based on the recent frequency of fire alarms, the simulation would need to maintain some state during each run so instead of modeling the chance of an alarm with a fixed probability, the model would keep track of a changing probability which up a tiny bit each day there is an alarm and goes down each day there isn’t one.
Modeling the probability of an alarm as a function of how many previous alarms have occurred is a simple example of dependencies within the model. More elaborate models can have more elaborate dependencies which will then require more elaborate tracking of state within each run of the simulation.
For example, a model used to predict how long a multi-person project is going to take, should probably account for the fact that the individual tasks within the project have dependencies on each other. If we ignore dependencies we could just estimate each of the individual tasks within the project and assume that if we have N people working on the project they can each work on a task until it is done and immediately start work on the next task and therefore the total amount of time spent is basically the sum of the (randomly generated) times for each task divided by the number of people. But of course in reality some tasks can’t start until other tasks are done and so the actual time taken will be longer, potentially much longer than that predicted by a model that ignores dependencies.
To build a model that accounts for dependencies we need to simulate not just how long each task takes but when work starts on it and thus when it is completed. Once we simulate that, we can simulate when each task starts based on when all the tasks it depends on have been complete. We can also model how many tasks can be worked on in parallel based on how many people are available.
A model that takes those factors into account will much more realistically model the effects of time lost due to some people having nothing to do because the tasks they would work on next can’t be started until some other task isn’t done yet.
In simulations like this, the simulation is still generating random values at the lowest levels—how long is this particular task going to take—but a lot of the simulation machinery is devoted to simulating what happens when.
One useful way to structure simulations like this is to model Work
as a thing that starts at a particular time and generates an end time. Simple instances of Work
that represent individual tasks just generate a random duration and add it to their start time to get their end time.
But Work
can also represent collections of Work
items that can occur either in parallel or in sequence. A ParallelWork
object contains sub-Work
items and the end time of the whole collection is the maximum end time of the sub items after starting them all at the start time of the ParallelWork
.
A SequentialWork
item, on the other hand, runs its sub-items in sequence, computing the end time of each of the items in the sequence by feeding in the end time of of the previous item as the start time of the next. The end time of the SequentialWork
is then the end time of the list item in the sequence.
Once we have some simple Work
classes and sequential and parallel Work
classes, we can model many projects as compositions of instances of those classes. For example, we can model several independent sequences of tasks where the sequences can run in parallel and then some tasks that can only be done after all the main sequences are complete.
Ultimately the thing that makes something a Monte Carlo simulation is just that we can run the simulation repeatedly and get different results due to randomness and then extract information from the distribution of results we get. Within the box of the simulation we can simulate the world with as much fidelity as we need. But we should also use the lowest fidelity model as we can get away with because higher fidelity models will require more computation and thus will take longer to produce useful results.