Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

 

Thanks for reading. Here you will find a huge range of information in text, audio and video on topics such as Data Science, Data Engineering, Machine Learning Engineering, DataOps and much more. The show notes for “Data Science in Production” are also collated here.


Monte Carlo models In Python

We were talking to a customer about the types of Machine Learning and experiments they were doing. One of their experiments was a Monte Carlo model. Most of the people in this meeting had never heard of Monte Carlo simulations before. I LOVE MonteCarlo models. They have so many applications. I wanted to jot down a little bit Monte Carlo models as a reference for you to refer back to. I will do some really basic probability solving with a Monte Carlo simulation in Python. 

Monte Carlo simulations (MCS) enable the investigation in to stochastic probabilistic problems (Alexa: Define Stochastic. A process which is random and non-deterministic). MCS were originally postulated by American-Polish scientist Stanislaw Ulam in 1947. History states that Ulam was playing solitaire (Canfield solitaire a particular type of solitaire in which once the cards are drawn, you either win or lose) while recovering from surgery, Ulam was contemplating what the probability of winning was once a hand was drawn. How many games would he need to play before he won.

In 1947, Ulam was working on the first general purpose computer (ENIAC) and thought this type of problem was well suited to general purpose computing. John Von Neumann saw the benefit of this process and applied it to the diffusion of neutrons. The name "Monte Carlo" was inspired by Ulam's uncle who liked to gamble (in Monte Carlo). The first paper on MCS was published in 1949.

MCS is a process which "learns about a system by random sampling".  Based on this description a MCS can be used to simulate many different problems, probability is a simple example. Understanding probability and how to solve probabilities can be hard, some probabilities are beyond that of simple calculations. Writing a model which solves the probability problems through repetition and random sampling is easy. As an example of this, if the birth rate ratio of boys to girls is 51:49. Based on this ratio what is the probability of having two children who are both girls. In R using a Monte Carlo simulation, solving this is trivial.

import random

babies = []
for i in range(0,49):
    babies += ['g']
for i in range(0,51):
    babies += ['b']

Answer = 0
Simulations = 1000000

for i in range(0,Simulations):
    girls = random.sample(babies, 2)
    if (girls[0]== "g" and girls[1]=="g"):
        Answer += 1

Result = Answer/Simulations
print(Result)

The answer above is approximately 24%, this is because MCS is based on random sampling. A MCS needs to be run a significant amount of times to reduce sampling error.

I am creating a list and populating it with 49 girls (g) and 51 boys (b). Then using the random package to sample two values from that list. Once the list has been sampled I compare them to see if they are both girls. If they are increment the answer by 1. Then loop and do the same thing again.

If I change the amount of simulations above to 10 the probability is 0.6 on my first run and 0.1 on my subsequent run. Increasing the simulations to 100 the result begins to converge on the answer. Increasing the amount of simulations, increases the accuracy by reducing the sampling error. In the image below you can see that there are significant outliers 0.4 and 0.1, however we appear to have a normal distribution where the mean will tend towards the answer.

2018-12-12_16-11-37.png

Monte Carlo simulations are a fantastic method for understanding how complicated interaction models behave. In Python/R they are simple to create and extend. MCS can be quickly extended with a series of rules. If you know the interaction between events, you can begin to model complex scenarios. Maybe you're looking to understand if someone will click through your website to the unsubscribe button. If you know all the interactions up to that point, can you work out the probabilities and begin to model that.

I will post a more complex scenario is a future post.