Phil Birnbaum, April 6, 2011
firstname.lastname@example.org , www.philbirnbaum.com
A couple of weeks ago, I played in the "Pinburgh" pinball tournament. It had a great format, which I liked a lot, very different from most other events. Instead of having your scores ranked against all other competitors, you play a series of matches, and you wind up with a W-L record. That's a great way of doing it ... you get a much better idea of how you rank against everyone else. (There were only 90 games; as a baseball fan, I kind of wish there had been 162, for a better intuitive idea of how good you were. Every baseball fan knows what 99-63 means, or 61-101.)
Actually, it's not quite that pure, because the rules are stacked so you wind up playing more games against people at your level. The tournament was designed by a man named Bowen Kerins, who created the pinball ranking system (similar to the chess system), and, as far as I can tell, is probably the most expert pinball sabermetrician in existence.
Let me give you a condensed summary of how it works (the full rules are here).
The first day, you play five sessions. Each session, you are matched against three opponents. Each of you plays a game of pinball. At the end of the game, your score is compared to the other three, and you get a W or a L. So if you beat all your opponents, you go 3-0. If you're the second highest, you go 2-1. And so on. It's the same as a "3-2-1-0" scoring system. (In normal match pinball play, and the playoffs, they use 4-2-1-0. But then you can't have a W-L record.)
You repeat this for two other games. At the end of the session, you've faced each opponent three times. So you wind up having gone somewhere between 9-0 and 0-9.
As I said, there are five sessions. In subsequent sessions, you are matched with people closer and closer to you in the standings. The second session is against players from your half of the standings. The third is from players from your quarter of the standings. By the fifth session, you are playing three people directly adjacent to you in the standings.
After the five sessions the first day, you're somewhere between 0-45 and 45-0. As it happened, the top player was 35-10, and the bottom player was 10-34. (Some players had fewer than 45 games because the number of players wasn't a multiple of four, which meant some sessions had 8 games instead of 9.)
At that point, there was a cut, like in golf. Only the top third of the 173 competitors continued on to be able to play for the top prize. Those were division "A". The middle third was group B, and the bottom third was group C. (Some players were not allowed to be in B or C, because they were historically too good; those stayed in A even if they wouldn't have qualified otherwise.)
The second day proceeded like the first day, but players competed against only those in their division. In addition, players in group C were reset to 0-0 after Day 1.
That means that at the end of Day 2, players in A and B had 90 games counted in the standings, and players in C had 45 games. The top 17 in each group went on to the playoff rounds for their division.
The format, I think, worked very well. It ensured that every one of the lesser B and C players had a legitimate shot at winning their division in Day 2, and kept the scores fairly tight by giving the worse players slightly lesser competition.
Anyway, I decided to try to simulate the tournament. Part of the reason was that I wanted to have a better idea of how I did. I had gone 21-24 the first day, barely making B, and then 27-18 the second day, for a final record of 48-42. That was good enough to get me into the B playoffs, where I flamed out in the first round.
But, really, how good was that? It's hard to say. It's above .500, of course, but I played lesser opponents than the A people. And the guy who finished first in C ... was he actually better than me, or worse? He was probably better; since I barely made B, he was probably only a game or two behind, but then he proved his talent by going 31-14 against C-level opponents.
So, here's what I did. I created 180 random competitors, and gave each one a talent rating from a normal distribution. Then I created a simulation of pinball scoring (which I won't explain here) that takes the talent into account. I created an initial seeding that was imperfectly correlated to talent, and decided that the top 15 seeds were the ones not allowed to drop to B or C.
I tweaked the simulation until certain results -- the distribution of session scores, and the distribution of final W-L records -- were close to the actual tournament outcomes. (Details in the appendix.)
Then I started pulling random numbers and ran the tournament.
My rules were slightly different than the real one in a few ways. First, I had 180 competitors instead of 173. Second, I included players who were restricted to A division, but I didn't include any players who were restricted to *A or B* division (although there were some of those in real life). Third, my groupings were slightly different than the real ones, because I didn't feel like programming groups of three instead of four. Fourth, I assumed the pool of talent was normally distributed, which is probably not true. Finally, I broke all ties with a single tiebreaker game (the real tournament did that only for important ties, and used seeding for the rest).
Still, I don't think these discrepancies should make a huge difference.
So, here are some of the results, after running a few thousand simulated tournaments.
First, as you might expect, there's a fair amount of luck in only 45 games, especially when those 45 games aren't independent (one bad game leads to three losses, not just one). Suppose you're the 10th best player in the tournament (in terms of actual talent -- in my simulation, I get to be God and know everyone's actual talent, instead of having to try to estimate it from previous results). In that case, you should expect to easily make A division, right? For one thing, you might be one of the 15 seeds who gets to stay in A no matter what. For another thing, there are 60 people who make A, and you're better than at least 49 of them.
So what are your chances of making A? Only about 83 percent. You have a 15 percent chance of winding up in B, and a 2 percent chance of being relegated to C.
That's a lot of choking. On average, one of the top 27 players will wind up in C, just by having bad luck the first day. (It's impossible to tell whether that happened in real life, but there was one player who went 20-24, and would have gone to C had he not been restricted to A.)
What's the opposite of choking ... clutching? There's even more of that. Of the bottom 18 players, you'd expect one of them to wind up in A, just by luck. The C-to-A migration is bigger than the A-to-C migration, because some players aren't allowed to play in C.
Here's a breakdown of what your chances are of making the various divisions, based on your relative talent. Percentages may not add to 100% due to rounding:
Rank 001: 100% A
Rank 002: 99% A, 1% B
Rank 005: 95% A, 5% B, 1% C
Rank 010: 83% A, 15% B, 2% C
Rank 020: 67% A, 27% B, 6% C
Rank 060: 40% A, 41% B, 19% C
Rank 090: 27% A, 43% B, 30% C
Rank 120: 17% A, 40% B, 43% C
Rank 171: 5% A, 20% B, 75% C
Rank 180: 2% A, 7% B, 90% C
Of course, even if a mediocre player gets lucky and makes A, he's (she's) probably not going to be lucky enough to finish near the top of A to make the playoffs. Everyone has at least a 2% chance to make A, even the worst player (although that might not be true in real life, if the players aren't really normally distributed; the worst player might be someone who's never played before but was still willing to pay the $100 entry fee). But not everyone has a decent chance to make the top 17 in A, and thus the playoffs.
To have at least a 1 in 200 chance of making the A playoffs, you have to be no worse than the 124th best player out of 180. I rounded to the nearest percent, so I can't tell you just how slim the 180th player's chances are, but "extremely" is probably a good approximation.
However, the bottom players' chances of making the C playoffs are still pretty decent. Even the worst player has a 2% chance, and that 124th player has a 13% chance. The 124th player also has a 1% chance of making the A playoffs (actually, probably closer to half a percent -- I rounded), and a 6% chance of making the B playoffs. Since everyone who makes the playoffs wins a prize (at least their entry fee back), player 124 has an overall 1 in 5 chance of taking home some money.
Here are some more playoff odds:
Rank 001: 86% overall -- 86% A
Rank 002: 78% overall -- 77% A, 1% B
Rank 005: 64% overall -- 59% A, 4% B, 1% C
Rank 010: 58% overall -- 45% A, 11% B, 2% C
Rank 020: 51% overall -- 28% A, 18% B, 5% C
Rank 060: 33% overall -- 6% A, 16% B, 11% C
Rank 090: 26% overall -- 2% A, 11% B, 13% C
Rank 120: 20% overall -- 1% A, 6% B, 13% C
Rank 171: 7% overall -- 0% A, 1% B, 6% C
Rank 180: 2% overall -- 2% C
The average chance of making the playoffs has to be 57/180, or about 32%. The tournament format seems pretty good about giving everyone a decent shot: even if you're dead average, 90th best, you still have a 26 percent shot at winning something.
Now, these findings are interesting in theory, but they don't help individual cases much: because, after all, nobody really knows their actual talent rank. Instead of converting talent to performance, it would be nice to convert performance to talent. That way, we can estimate how good we really are.
For instance: suppose you finished first in C division. How well does that compare to, say, the 10th place finisher in B division? If those two competitors were to play a match against each other, who should be favored to win?
As it turns out, the average first place finisher in C was the 48th most talented player overall (and presumably wound up in C by having awful luck in the first 45 games). The average 10th place finisher in B was the 66th best player overall. So, the C1 guy is probably a little better than the B10 guy.
What about the overall standings leader, the guy who finished number one in A? His average talent, surprisingly: 17. That is: on average, he's only the 17th best player at the tournament. That's not as low as it looks, actually. It's mostly a bunch of guys with single digit rankings, with an occasional larger number who got really lucky.
Here are those two, along with a few other results:
Don't take these too seriously: you improve the estimates if you use actual W-L records, rather than rankings. Here are those numbers.
78+ wins: rank 1.5 out of 180
74 to 75 wins: rank 2
71 to 73 wins: rank 3
40- 5: 39
The C rankings are low even for great records; for instance, going 40-5, which is incredibly good, suggests you're still only 39th best out of 180. That's because if you're in C division, you had a poor record on the first day, and the simulation effectively combines that with the 40-5 when estimating how talented you really are.
Okay, now let's look at who gets into the top four -- that is, the final round of the playoffs. The way the playoffs work is this: first place in the standings gets a bye. The other 16 break into four sessions of four players each. They play three games, each game scored 4-2-1-0. The top 7 of 16 point-getters join the bye guy in the semi-finals. Those players break into two semi-final sessions, and the top 4 of 8 make the finals. Those four finalists play one session of three games, and are ranked by points in that session.
Here are the chances of being one of the four players who get to the finals, based on talent rank (which, again, is unknown in real life):
Player ranked 001: A 48%
Player ranked 002: A 35%
Player ranked 003: A 37%, B 1%
Player ranked 004: A 23%, B 1%
Player ranked 005: A 20%, B 2%
Player ranked 010: A 12%, B 5%, C 1%
Player ranked 020: A 6%, B 7%, C 2%
Player ranked 030: A 3%, B 5%, C 3%
Player ranked 040: A 2%, B 5%, C 3%
Player ranked 050: A 1%, B 4%, C 4%
The 50th ranked player has only a 15% chance of winding up in C division at all. But if he does, he has a 25% chance of making the finals (4% of all tournaments is 25% of 15% of the tournaments).
Player ranked 060: A 1%, B 3%, C 4%
Player ranked 070: A 0%, B 3%, C 3%
By rank 066, the probability of making the A division finals rounds to zero, which means it's less than 0.5% (1 in 200). By rank 70, players have a better chance of making it to the C finals than the B finals.
Player ranked 080: A 0%, B 2%, C 4%
Player ranked 090: A 0%, B 2%, C 3%
Player ranked 100: A 0%, B 1%, C 3%
Player ranked 110: A 0%, B 1%, C 3%
Player ranked 120: A 0%, B 1%, C 2%
Player ranked 130: A 0%, B 0%, C 2%
The 130th ranked player is about the limit for having a non-zero chance (after rounding) of making the B division finals.
Player ranked 140: C 2%
Player ranked 150: C 1%
Player ranked 160: C 1%
After the 168th player, the chance of making C finals drops below 1/200.
That's the finals. What about the grand prize, finishing first overall?
As it turns out, and as the chart below will show, you have to be in the top third of competitors to have an appreciable chance to win. 71 percent of the time, the ultimate winner is one of the ten best players. 98 percent of the time, the winner is in the top 60.
However, a long shot does occasionally come through. In 8,062 random tournaments, the lowest ranked winner was number 133 of 180. That only happened once. 129th also happened once, 121 happened once, and 125 happened two times.
Here are the full results, broken down into a denominator of 1,000 tournaments to make the numbers easier to understand.
Player ranked 001 won 199 times out of 1,000
Player ranked 002 won 119 times
Player ranked 003 won 92 times
Player ranked 004 won 70 times
Player ranked 005 won 58 times
Player ranked 006 won 42 times
Player ranked 007 won 38 times
Player ranked 008 won 37 times
Player ranked 009 won 29 times
Player ranked 010 won 25 times
Players ranked 1- 10 won 709 times (combined)
Players ranked 11- 20 won 157 times
Players ranked 21- 30 won 65 times
Players ranked 31- 40 won 32 times
Players ranked 41- 50 won 16 times
Players ranked 51- 60 won 8 times
Players ranked 61- 70 won 5 times
Players ranked 71- 80 won 4 times
Players ranked 81- 90 won 1 time
Players ranked 91-100 won 1 time
Players ranked 100-130 won 1 time
Players ranked 131-180 won 0 times
I hope those add up to about 1,000.
And finally: money winnings. As it turns out, if you're in the middle of the pack, you should expect to get $50 of your $100 back in prizes. Here are the prizes won by players of various rankings (out of 180, God's-eye view of talent rank):
Player ranked 001 won $ 940
Player ranked 002 won $ 656
Player ranked 003 won $ 540
Player ranked 004 won $ 455
Player ranked 005 won $ 406
Player ranked 006 won $ 348
Player ranked 007 won $ 324
Player ranked 008 won $ 323
Player ranked 009 won $ 298
Player ranked 010 won $ 287
Player ranked 015 won $ 241
Player ranked 020 won $ 207
Player ranked 030 won $ 151
Player ranked 040 won $ 121
Player ranked 050 won $ 108
Player ranked 060 won $ 90
Player ranked 070 won $ 77
Player ranked 080 won $ 66
Player ranked 090 won $ 61
Player ranked 100 won $ 53
Player ranked 110 won $ 43
Player ranked 120 won $ 39
Player ranked 130 won $ 32
Player ranked 140 won $ 27
Player ranked 150 won $ 21
Player ranked 160 won $ 15
Player ranked 170 won $ 10
Player ranked 180 won $ 2
Roughly speaking, if you're in the top 1/3, expect to win your money back or more. If you're in the middle third, expect a little over a half your money back. And if you're in the bottom third, in the long you'll win back 1/4 of your registration fee.
A lot of these results depend on your ranking within the pool of 180 entrants. As I said, that's pretty much impossible to know for sure. You can get a very rough estimate by combining some of these results, but there'll be a fairly large confidence interval around it.
I'll use me as an example. I finished in B, with a 48-42 record. That was tied for 10th through 15th in the standings. In the playoffs, I was tied for 13/14 out of 17. Let's call it 14.
According to the simulation, 14th in B averaged 70th in overall talent. But that's 14th in B out of 180 competitors. I finished 14th in B out of only 173 competitors. That's easier, so let's reduce my ranking from 70 to 73.
If you get another estimate by looking at W-L record, 48-42 in B was worth 78th in overall talent.
So I'm probably somewhere between 73rd and 78th. Let's call it 75th.
Looking up players who were 75th, suggests that:
-- I have a 33% chance of making it to A next year; 45% of B; and 24% of being relegated to C.
-- I have a 4% chance of making the A playoffs; 14% chance of making the B playoffs; and 12% chance of making the C playoffs.
-- So, the consolation is: if I *am* relegated to C, I have a 50-50 shot of getting to the playoffs (12% out of 24%).
-- I should expect to win $78.
-- Finally, my chances of winning the grand prize are only about .45 in 1,000, or 0.045%. That's because the 71-80 group won 4.5 in 1,000, and I'm 1/10 of that group. Effectively, I'm about a 2000:1 long shot.
However: I may not actually be 75th. I could conceivably be much higher, and had bad luck at the tournament, or I might be much lower, and had bad luck. This is where you need more information.
But let's suppose I don't have any other information, because this was my first tournament. What might have happened is that I'm better than 70th, but had bad luck.
The standard deviation of wins in this tournament due to luck, in 90 games, is about 6. So, instead of 48-42, there's a 2.5 percent chance I'm actually 2 or more SD above that, which would put me at 60-30. If that were the case, I'd certainly have finished in A instead of B. The better competition would have brought me down somewhat: so let's call it 55-35.
If I were 55-35 in A, then, suddenly, I'm 28th overall, instead of 75th. That means I have a 6 in 1000 chance, instead of a 0.5 in 1000 chance.
Therefore, the 95% confidence interval for estimating my skill from this one tournament is very wide; it's centered on 75, but could be as high as 28 (and probably as low was 120ish).
The overall moral is that I'm still going to base my expectation on an estimate that I'm 75th best out of 180. But, that could be way off. For best accuracy, I should play a whole bunch of different tournaments, so that I can estimate my ranking more precisely. However, as it turns out, I've played in two other tournaments, and finished roughly the same in those as I did here. So I'll stick to the estimates above, for now.
Still, I'm hoping that I actually just had bad luck in all three tournaments, and that I'm actually much better than the records show. That's really my only decent hope for winning next year.
This is a description of how realistic simulation is, and how it gives roughly the same results as the actual Pinburgh 2011 tournament.
In real life, there were 1,620 player-sessions. There were 10 sessions for 173 players, for a total of 1,730, but 110 of them were 8 games instead of 9, so I ignored those.
If all players were equal, how many 9-0 sessions would you expect? Well, for every four player session, there’s a 1 in 16 chance that one of the four will go 9-0. Because, the first game, one of the players must go 3-0. The chance that player goes 3-0 twice more is 1 in 4 squared, which is 1 in 16.
So, 1 in 16 for four player session is 1 in 64 sessions. 1,620 divided by 64 equals 25.3. So you’d expect 25.3 cases of a player going 9-0 – and, by the same logic, 25.3 cases of a player going 0-9.
What about 8-1? Well, there are three ways a player can win eight games: 3/3/2, 3/2/3, and 2/3/3. So you’d expect three times as many 8-1 as 0-9. That works out to about 76 out of 1,620.
I could repeat the calculation for all scores, but it was easier just to run a simulation. Out of 1,620 times, you’d expect:
9-0: 25 times
8-1: 76 times
7-2: 153 times
6-3: 252 times
5-4: 305 times
4-5: 305 times
3-6: 252 times
2-7: 153 times
1-8: 76 times
0-9: 25 times
Now, that’s the theoretical statistical distribution when all the players are equal. In real life, of course, some are much better than others. And so you’d expect more extreme results, like 9s 8s and 7s, and fewer 5-4s and 4-5s.
That happened. Here are the real results, then the simulated ones:
9-0: 25 real, 25 simulated
8-1: 87 real, 76 simulated
7-2: 170 real, 153 simulated
6-3: 239 real, 252 simulated
5-4: 277 real, 305 simulated
4-5: 294 real, 305 simulated
3-6: 261 real, 252 simulated
2-7: 152 real, 152 simulated
1-8: 88 real, 75 simulated
0-9: 26 real, 25 simulated
As expected, the real results are more extreme than the simulated results, with a few exceptions that are probably because of luck.
BTW, you can find a full record of the real results here.
Let me quickly figure out the standard deviation of talent, using a method shown by Tom Tango a few years ago.
The standard deviation of the “real” is 2.000. The SD of the simulated was 1.936. By the equation
SD^2(talent) = SD^2(actual) – SD^2(theoretical)
… we get that the SD of talent equals almost exactly 0.5. That means that if the average player would go 4.5-4.5 in a typical session, the extremely talented players, 2 SDs from the mean, would go 5.5-3.5, for a winning percentage of .611, or 55-45 over a 90-game tournament.
OK, so, as expected, the real results are different from the simulation, because the simulation had everyone with the same talent. Now, I tried that by making all the players different in talent. I made the distribution normal, trying different standard deviations and rerunning the simulation, until I found one that seemed to fit the real life data the best.
But I’d never be able to get a perfect fit. Why? Because the real results aren’t “right” – they’re not properly extreme everywhere. Specifically, the 9-0 and 0-9 numbers are significantly smaller than they should be. In fact, they’re almost exactly at the ‘every player is equal’ mark, when they should be higher. That’s probably just because of luck, considering the 8-1 and 1-8 numbers look OK.
So we’re not going to get a perfect fit. Here’s the fit I finally settled on:
9-0: 25 real, 32 simulated
8-1: 87 real, 84 simulated
7-2: 170 real, 155 simulated
6-3: 239 real, 245 simulated
5-4: 277 real, 289 simulated
4-5: 294 real, 290 simulated
3-6: 261 real, 249 simulated
2-7: 152 real, 158 simulated
1-8: 88 real, 84 simulated
0-9: 26 real, 29 simulated
It seems like a reasonable fit, especially considering that we have strong reason to suspect the real-life data to be a bit off at the extremes.
Now, let’s check to make sure the overall standings seemed to come out OK. I looked at a bunch of results, to compare real to simulated. (Full results online here.)
Top record in A: Real, 62-28. Simulated, 65-25.
Top record in B: Real, 54-36. Simulated, 57-33.
Top record in C: Real, 31-14. Simulated, 33-16.
Minimum playoff record in A: Real, 53-37. Simulated, 53-37.
Minimum playoff record in B: Real, 47-43. Simulated, 48-42.
Minimum playoff record in C: Real, 25-20. Simulated, 25-20.
Pretty good, except that the top records are always higher in the simulation than in real life. I think that’s because real life didn’t have enough extreme 9-0 and 0-9 records. I think if you added in a few more 9-0s and 0-9s, the real life winners would have had a couple more wins.
Or, it could just be that there are more good players than a normal distribution would predict. I suspect that might be the case, at least partially, considering that the best players in the world are much more likely to come to Pittsburgh for this event.
However, despite the small discrepancies, I think the simulation does a reasonable job of coming close to the real-life numbers, and is close enough that we can trust most of the results.
One last thing to explain: how the games were simulated.
Here’s how it worked. For this simulation, every player was given a talent level, which was the sum of (a) 50, and (b) 10 independent uniform variables between 0 and 4. That means the overall talent level was approximately normally distributed with mean of 70 and variance that … I didn’t actually calculate.
Then, to simulated a pinball game I simulated repeated “shots” for each player until he lost three balls. There were three kinds of shots: good shots, OK shots, and lose-the-ball shots. Good shots scored between 9 and 10 points (9 plus two random uniform variables between 0 and 1). OK shots scored between 0 and 1 points (two random uniform variables between 0 and 1). Lose-the-ball shots scored 0, plus loss of ball.
For a player with talent X, the probabilities for each of the three shots was:
Good shot: X%
OK shot: 90% of non-good shots (that is, 0.9 * (100 – X)%)
Lose-the-ball shot: 10% of non-good shots (that is, 0.1 * (100-X)%).
It’s a pretty crude simulation of a pinball game, but it seemed to work OK. It actually doesn’t matter what the internal details are of simulating a game, so long as players of a certain skill beat opponents of a certain skill with the right probability. And the final results of the simulation suggest they did. If the good players beat the worse players too often, there would have been too many 9-0 and 8-1 sessions. If the good players beat the worse players not often enough, there would have been too many 5-4 and 4-5 sessions.
And, of course, the scores don’t map to real scores. Real-life scores aren’t linear. In real life, the first 20 good shots might net you one million points, while the second 20 good shots might net you ten million points. However, since all that matters is who beats whom, the actual scores don’t matter. In older-style tournaments (for instance), where your standing was based on the sum of your scores, that wouldn’t be the case.