Wednesday, March 18, 2009

Picking the 2009 March Madness Basketball Brackets with Statistics

I don't know anything about college basketball (unlike President Obama). I don't mean that I don't know the main rules, or how many players there are, or what's allowed and not allowed. I mean that I don't know what teams are good or bad, which players are destined for the NBA, which coaches are the best or who won what game in 1985. I am not even sure of what teams are in what division so I always check. I suspect that this ignorance is exactly the correct approach to take when picking the game winners in the NCAA Division I playoffs, March Madness.

Instead of learning the teams and the players, I have explored the statistics of the past 24 years (data can be found here) (Last years picks, Round of 64 and 32 upsets, Final Four and Championship probabilities) combined with other's specialized knowledge like the Sagarin ratings. This year I am updating some of my charts to include data from 2006, 2007 and 2008.

The pool I enter favors upsets. The points for each round are the round multiplier times the seed of the winning team that you picked. Thus if a 10 seed wins Round 2 and you pick it you get 2*10 for points. To win this type of pool it is imperative that you pick upsets. Game results for 24 years of Round 1 are shown graphically below.

Some quick points for Round 1, the round of 64:
  • No 16 seed team has ever won in the first round. Don't be the first to pick one.
  • 15 seeds are also very safe and normally win their games.
  • History shows that there will be at least one, and in some years two upsets favoring a 10, 11, 12 seed.
  • One could make the case for one upset a year favoring a 13 and 14 seed as well.
  • 9 seeds win against 8 seeds more than half of the time. Pick two upsets.
Even in Round 2 with 32 teams, upset picking is important as well. 24 years of Round of 32 matchups are shown below with expected and upset outcomes tabulated.

Because this round depends on the outcome of the first round the number of opportunities is different for each matchup. In the most extreme case, no 16 seed team has ever beaten a 1 seed to advance to this round, so there is no data for that matchup. Only once has a 15 seed beaten 2 seed and then played a 7 seed, thus there is only one occurrence on the chart.

Some lessons from the Round of 32 chart:
  • 1 seed teams typically win in this round as well, rarely being beaten by 8 or 9 seeds.
  • Matchups with 5 vs 4 seeds, 6 vs 3 seeds and even 10 vs 2 seeds and 12 vs 4 seeds (surprisingly) seem to be toss-ups over the 24 years of data. Almost half the time there is an upset and the lower seed wins. If you have them in your bracket pick the correct underdog half of the time.
  • Matchups with 7 vs 2 seeds do result in upsets about a third of the time. Look for opportunities to pick one.
The results of this chart show what teams advanced to the Sweet Sixteen Round and should help to determine which upsets to pick according to past history.

Below is a matchup outcome chart for the Sweet Sixteen round which is similar to the earlier charts, but much more complicated.
As each round progresses there are more combinations of possible matchups, though most of them have never actually occurred in history of the tournament in its modern form. No 16 seed has ever advanced so those matchups are not represented. 15 seeds rarely advance, so many of those matchups also have no data.

Some lessons gleaned from the Round of 16 outcome chart:
  • 1 seeds usually win. They always beat 12 seeds that make it through.
  • The closer the distance between seeds the more the outcome is a tossup. This is true for all of these charts.
  • In the three times that 11 seeds have made it to this round they have beaten the 7 seed they played. Whether that is statistically significant or not is the question.
On the other side of the range, rather than add combinatorial complexity, it is easier to compile the results of past years for the late rounds to see how likely it is that certain seeds reach the Sweet Sixteen, Elite Eight, Final Four, The Championship Game and finally win the championship. These frequency charts are easier to read than matchup charts at these rounds because the combinations of matchups grow large as the tournament progresses.

Sweet Sixteen frequency chart below
These frequencies are determined by who succeeds in the Round of 32 and are reflected in the Round of 32 outcomes chart above. Look at the lump for the 10, 11, and 12 seeds. In years where these teams move forward knowing to pick them results in a large multiplicative effect on your score. Correctly picking #10 Davidson last year won me the pool.

Elite Eight seed frequency chart below.

Final Four seed frequency chart below.
Championship game seed frequency chart below.Championship winner seed frequency chart below.Some points for the Final Four, Championship game, and winner:
  • Every other year or so a 5, 6, 8, 10, 11 seed makes it to the Elite Eight.
  • One 11 seed, three 8 seeds, three 6 seeds and four 5 seeds have appeared in the Final Four in 96 opportunities over 24 years, choose these upsets sparingly, but if you get them right you might just win the pool.
  • In the Championship game, one 8 seeds and two of 4, 5, and 6 seeds have made it that far. use sparingly.
  • No team lower seeded than 8 has won the whole Championship. A 6, 8 and 4 seed have won it once each. The Final winner has been a 1 seed more than half of the time.
I have also taken the point totals for the past 24 years assuming a perfect sheet and plotted them.

I try to make sure that the potential points on my Playoff sheet add up to a reasonable number based on the past history of the tournament. The histogram below is a simple way to compare the past data to a current bracket selection.

It provides a way of ensuring that I haven't picked to many upsets, or worse, been too cautious and picked too few. Last year this method caused me to adjust my sheet to have more upsets and pick #10 Davidson to make it to the Elite Eight. I won the pool so handily that I was already uncatchable at that round.

After all of this discussion of picking upsets and examination of the data indicates that upsets happen and are the key to winning the pool, but which upsets and where. This is where we resort to the expertise of others. I use the Sagarin ratings (click on 2008-09 NCAA men's ratings by team) which are essentially a least squares ranking of all of the teams, based on all of the games that a team has played in the year. He suggests using the Predictor ranking to predict the outcome of a game rather than the ranking itself. Every year I match the teams to their rankings, the rankings represent the number of points a team is expected to score in a game so the difference of these rankings is the difference in the game. Since there is some error in the rankings I choose a value below which I will pick the lower seated team to win (picking upsets) and generate my bracket.

This year I automated the process in Excel. If a Predictor difference fell below the chosen factor I set the lower ranked team as the winner. Only for the final four does the model let the best team (higher Predictor score) win regardless of seed. A plot of the resulting expected points versus this factor shows some interesting cutoffs. Realize also that the home advantage for the Sagarin ratings this year is 3.79, almost two baskets. So the factors listed below are not out of the question. Always assuming that they fault to the upset is unreasonable, but called for to maximize point possibilities for this particular bracket.

In a similar manner to the inflection points from my earlier football simulations, certain values for the factor make the potential points jump between values as teams losing teams win and winning teams lose at certain rounds, only to be swept away at higher rounds. This leads to a high sensitivity of the final potential points to small changes in the game spread factor. An earlier plot shows that the 50% median value for the potential points was 631 and that 90% of the years had total pool points of less than 796. With this in mind I set the factor to 2.33, just below the first step change from 665 to 845 and then I examined the pool for reasonableness according to the statistics shown above. One caution with this model is that it might allow improbable events like too low a seed to make it through to a high round, so I used it merely to cause me to push the limits on upsets.

All that being said, be aware that on any day, any given team can beat any other, thus the format of March Madness is given to upsets and surprises and picking a bracket is still as much luck as skill. These models are an attempt to quantify this uncertainty and use it to drive bracket picks that will take advantage of luck, upsets and surprises when they occur.


Anonymous said...

Can we see your bracket?

Anonymous said...
This comment has been removed by a blog administrator.
Richard said...

Sorry no gratuitous bad language on the blog. I imagine that the second anonymous commenter was expressing disbelief at the amount of work that went into the post and thus I will take the comment as a compliment.

Bex said...

This post is one of the reasons I love my brother so much...this just proves that being smart is COOL! And I am completely serious.

Keep up the stats, they really are fun. Granted, being of complete average intellegence, it's still enjoyable for me...even if I do have to re-read some paragraphs (hee, hee)