Monte Carlo basics

The basic idea behind Monte Carlo simulations is that if we generate a bunch of randomly simulated outcomes of the thing we are trying to model we can measure various things about those outcomes. For instance if each outcome is a single numeric value we can measure things like the range in which 90% of those outcomes fall, giving us 90% prediction interval on the outcomes. Or we can measure the probability of various outcomes such as probability of an outcome above a certain value. When the simulated outcomes are more complex, such as in a model of a presidential election, we can measure the probability of more complexly defined outcomes such as the probability that the incumbent wins the election while losing their home state.

The fundamental building blocks of the Monte Carlo simulations with are random numbers, generated according to different distributions. A distribution describes the relative likelihood of different numbers being generated and is perhaps most easily visualized as a histogram of the values generated. Some of the basic distributions we’ll use in our Monte Carlo simulations are uniform, normal and log normal.

Uniform distribution

The simplest random distribution is the uniform distribution which generates numbers between some minimum and maximum, all with equal, a.k.a. uniform, probability. Its histogram is a horizontal line.

In Java we can generate uniformly distributed (pseudo)random numbers via the nextDouble method on java.util.Random. It generates double values on the half-open interval [0, 1) with all nine quadrillion values equally likely. Or we can use nextInt(int) to generate uniformly distributed ints on the half-open interval [0, n).

The uniform distribution is useful for modeling discrete events such as a coin flip or a die roll where all outcomes are equally likely.

Normal distribution

The normal distribution is the classic bell curve and is useful for modeling many natural processes such as the heights of people, scores on exams, etc.

The normal distribution is parameterized by its mean and standard deviation. Even though some bell curves are tall and skinny (i.e. they have a smaller standard deviation) and others are shorter and wider (larger standard deviation) the proportion of values under any interval of a bell curve is always related to the curve’s standard deviation in the same way. Notably, 68% of the values generated from a normal distribution will fall with within one standard deviation of the mean. We can call this interval, from one standard deviation below the mean one standard deviation above the mean a “68% prediction interval” in that it predicts where 68% of the generated values will fall.

If we want a prediction interval that captures more of the generated values, say 90% rather than 68%, we need to widen the interval. The width required is measured in standard deviations and called the “critical z value”. Computing a critical z value for a given confidence interval is somewhat involved but for a 90% confidence interval it is 1.645, meaning 90% of the values generated from a normal distribution will fall within 1.645 standard deviations of the mean.

In other words, if low and high are the bounds of a 90% prediction interval on a normal distribution with mean mean and standard deviation stddev:

Thus if we know low and high we can solve for mean and stddev and if we know mean and stddev we can solve for low and high.

The one thing to be aware of is that the tails of a normal distribution go to infinity in both directions so while a normal distribution with a mean well above zero and a standard deviation much smaller than the mean will mostly only generate positive numbers, extreme outliers could be negative.

Log normal distribution

The log normal distribution produces values whose natural logs are normally distributed. Since logarithms aren’t defined for negative numbers (without going into complex numbers), the values of a log normal distribution all have to be non-negative. So it’s a bit like a normal distribution but with the bell rising quickly from zero and then a long tail out to the right. This makes it good for modeling things like task durations that can’t take less than zero time but which can, unfortunately, sometimes take quite a bit longer than average.

Generating log normally distributed random numbers is easy if you have a source of normally distributed random numbers: generate normally distributed numbers and raise e to the power of those values. The log normal distribution is typically parameterized by the mean and standard deviation of the underlying normal distribution. However if we want to start with a 90% confidence interval on the log normally distributed values—for instance if we have estimated, with 90% confidence, that some task will take between 3 and 15 days—then we need to first translate the end points of the interval into the corresponding endpoints of the underlying normal distribution by taking their log.

With these values we can then determine the mean and standard deviation of the underlying normal distribution based on those endpoints and generate normally distributed numbers and raise e to those values get log normally distributed numbers.

Confidence intervals vs prediction intervals

While the term “confidence interval” is more commonly used (probably because most introductory materials on statistics are focused on analyzing sample data), in Monte Carlo simulations we are technically dealing with “prediction intervals”. A “confidence interval” is an interval, i.e. a range of values, that we derive from a sample of data from a larger population such that we can be, say, 90% confident that the interval captures the true value in the population, such as the population mean. A “prediction interval” on the other hand, is an interval that describes the range we believe, with some level of probability, that a future value will fall in. However, the difference between the two is pretty subtle so don’t worry about it too much.

Measured prediction intervals

The foundations of a given Monte Carlo simulation are prediction intervals that we create by estimating various values that are simple enough that we can estimate them, for instance how long a single task will take. Our estimates take the form of 90% prediction intervals, i.e. ranges that we think have a 90% chance of capturing the true values we are trying to estimate. The Monte Carlo simulation then runs by generating values for those estimated values and combining them according to the model to produce composite outcomes.

One of ways to use a Monte Carlo simulation is to generate outcome that are simply a single number: the time it will take to complete a project, how much money we will spend on something, etc.

For instance the time to complete a project is the sum of all the steps in the project, each of which will take a random amount of time, based on our estimated prediction intervals. And the cost of textbooks needed for a class is the product of two random numbers: the number of students who enroll and the best price we can get on the textbook.

Once we have the ability to generate simulated outcomes, we can measure the range in which 90% of those outcomes fall; to the extent we trust our simulation, we can treat this as a 90% prediction interval on those outcomes. Obviously, there are many ways the simulation can be wrong: our estimates of the range of possible values of various sub parts may be overconfident; the distribution we choose for the sub parts may not match reality well, or we may have left out aspects that, in reality, will affect the real outcome. But those are not software problems.

Testing the basic machinery

To make sure our basic machinery is working, we should start by running a trivial Monte Carlo simulation of a single normally distributed random value with a known mean and standard deviation so we can make sure that the measured interval that captures 90% of the generated values in fact matches the 90% prediction interval that we can figure out from the mean and standard deviation and critical z value of 1.645. Once that machinery works we can have some faith that it will work for more complex simulations where the whole point is we don’t know in advance what the prediction interval should be.

Step 1: Standard normal distribution

  1. Compute the expected 90% prediction interval for what’s called the “standard normal distribution”—a normal distribution with a mean of 0 and standard deviation of 1—using the equations above. (Hint it’s -1.645 to 1.645.)

  2. Generate random, values with java.util.Random's nextGaussian method which returns random values from the standard normal distribution. Collect them all in an double[] or ArrayList<Double>.

  3. Sort the collection of generated random numbers and find the values 5% from the start and end of the collection: these are the bounds of the measured 90% prediction interval from the generated values.

  4. Confirm that the prediction interval computed in step 1 matches the measured prediction interval from step 3.

Step 2: Another normal distribution

Do the same thing but this time pick an arbitrary mean and standard deviation and scale the randomly generated numbers you get from nextGaussian by multiplying by your new standard deviation and shift by adding the new mean to get normally distributed numbers with that mean and standard deviation.

  1. Compute the expected 90% prediction interval.

  2. Collect randomly generated values.

  3. Sort the generated values and find the 90% prediction interval.

  4. Confirm that the prediction interval computed in step 1 matches the measured prediction interval from step 3.

Step 3: Log normal distribution

Once you’ve got the basic machinery working, you can add code to generate log normally distributed random numbers based on a 90% prediction interval and make sure that the measured interval matches what you put in.

  1. Pick a 90% interval.

  2. Collect randomly generated values, using your log normal generator.

  3. Sort the generated values and find the 90% prediction interval.

  4. Confirm that the prediction interval from step 1 matches the measured prediction interval from step 3.

Seeing the benefits

So far we’ve just been reproducing, via a Monte Carlo simulation, prediction intervals that we can derive mathematically from the various random distributions. However once we start combining values from our estimated prediction intervals it gets harder to analyze what the expected distribution of values is. As another simple step to see how this works we need to start generating random outcomes by combining random values.

Step 1: Make normal random generators with different parameters

  1. Make a set of normally distributed random number generators such as some number of normal distributions with different means and standard deviations.

  2. Collect randomly generated values that are the sum of values generated from each of these generators.

  3. Sort the generated values and measure the 90% prediction interval.

Compare the measured prediction interval to the interval you get by adding all the lower bounds and all the upper bounds of the underlying generators. You should find that your measured prediction interval is narrower than that interval. See how that is affected by the number of generators in the set. (You can also compare your measured interval to the interval you get by adding the means and taking the square root of the sum of the squares of the standard deviations; that should actually match.)

Step 2: Make log normal random generators with different parameters

  1. Make a set of log normally distributed random number generators such as some number of normal distributions with different means and standard deviations.

  2. Collect randomly generated values that are the sum of values generated from each of these generators.

  3. Sort the generated values and measure the 90% prediction interval.

Same as before, compare the measured prediction interval to the interval you get by adding all the lower bounds and all the upper bounds of the underlying generators. You should find that your measured prediction interval is narrower than that interval. See how that is affected by the number of generators in the set. Can you figure out a way to compute this interval without simulation?

Step 3: Make a more complex set of random generators

  1. Make a larger set of random number generators with different distributions and different parameters.

  2. Collect randomly generated values that are some more complex combination of the values generated from each of these generators.

  3. Sort the generated values and measure the 90% prediction interval.

Now try applying the same mathematical combination to all the bounds of the original prediction intervals and compare your measured prediction interval to what you get.