R – AS COISAS

I do not care for football (soccer). So what do you do in order not to be totally alienated from the surrounding conversations in a world cup year? More so when the cup is happening in your home country?

To me it involves running some R script to assess the chances for each team and who will be playing where. It all began as an exercise to show the economics intern how to build a Monte Carlo simulation in the R software environment for statistical computing and graphics.

The script has two main parts: (i) the probability of victory for each team pairing, let’s say Brazil and Germany; (ii) using those probabilities, run simulated world cups, game by game, randomly drawing winners and moving on to the next game according to the previous random results.

I ran one million of those simulated world cups. That’s most likely well above what’s needed for many statistically significant uses. But this is not a computation intensive task and runs in not much time.

The simulation of games is quite neat and gives interesting results. For instance, if you run a batch where there’s only one favorite team and all the others have the same chances in head to head matches, those teams that cross the favorite’s path to victory are penalized and end up with the worst chances of winning the cup.

The other neat thing of the simulation is that you can have game-by-game odds, not only of who wins but also who will probably play each match. Of course, this comes from the fact that a game’s players are the winners of matches from the previous round. So, if you have a ticket for a game of the quarterfinals you can check what is the match that you will probably watch.

The other part of the script requires figuring out probabilities for each match. And here is where things are less solid in this exercise and where there is most room for improvement. Nonetheless, we still can say that end results are at least plausible, as we will see.

These probabilities are based on a list of points earned by each country in world cups. For each match, the probability of victory for a team is its share of points in the sum of points for the two teams in that particular match. So for a Brazil X France, Brazil had 216 points, France has 86 points, so Brazil’s chances are ${216}\div{(216+86)} $ or 72%. Of course there are some much more complex and better models. Notice that no attempt is made to model the possibility of ties.

So what kind of things do we see in results? For example, we see that Portugal had about 10% of chance of being part of the semi finals. This is consistent with a friend’s assessment of Portugal as having “some chance” of doing so.

Now, if this model allows me to mimic the opinions of a football fan, then I would say it has accomplished its goals 😀

But what probably most people will be interested will be who will win the Cup. So, to get this out of the way, this is the table:

We can follow this information step by step, figuring out the distribution of victories among countries. That is what I try to show in the next chart, from the round of 16 (oitavas) until the final:

In this chart we follow the distribution of victories among countries as the final match approaches.

Another interesting thing to do with the model is to re-run or re-query it after each game, fixing whatever has already happened. This works up to the final, where the answer that we will get is the one given by our simple match winner model.

Somewhere in the future, hopefully before the world cup, I might be posting odds for each game.

You can download the file with the winners for all the 1 million simulations here. The R script is here.

I thank Andre Luchine, Beto Boullosa, Charles Queiroz, Fernando Varejão, Marcio Eduardo Bezerra and Neca Boullosa for their consulting on the inner workings of the World Cup.