The NFL and Sample Sizes: It's Not Just the Salary Cap That Creates Parity

The following is a short brief from our professional client newsletter.

The Bottom Line: With the CBA renewal in the NFL approaching next year, it’s interesting to see how the NFL game schedule structured in a way that makes the championship outcome less deterministic. The salary cap contributes to competitive balance and that is often the most-cited reason for the NFL’s equality across teams… but I venture a hypothesis that it’s just as much, if not more so, the fact that the NFL only plays 16 games per season and individual game outcomes are more “random”. This is due to a small sample size. Together—the salary cap and the league’s small sample size of games—create more parity in the sport, allowing teams to pass around the crown more frequently and create more financial stability in the league.

When I tell people what I do (e.g., use numbers and information to make assessments and predictions), I often get questions about the NFL and who is going to win the Super Bowl. I’m not sure if this is because I’m from a football-obsessed & baseball-demoralized city (Pittsburgh), but I can’t help but wonder why my associates often gravitate to football as a classic example of prediction in sport. You’d think baseball would come up more, given the attention paid to it ever since Michael Lewis wrote Moneyball.

Maybe it’s because Super Bowl champions are so hard to predict?

Of course, Vegas does it every year and is generally more right than wrong. But that’s all they have to be (51% of the time right, assuming that there is at least an even number of big and small bets placed on winning and losing sides). They also get to change their predictions as the season goes on. Admittedly, I don’t know what their success rate is (my hunch is that it’s higher than 51%). Since I’m not in the business of gambling, I haven’t taken the time to do a full analysis.

Still, my point about American football being hard to predict comes down to one simple fact about statistics: you want a good sample size to make good predictions. The football season doesn’t provide that. There are several sample size limitations prevalent in football making it hard for prediction, but I’ll focus on just one here: A total population of 16 games makes it harder to identify who the best team is to win the playoffs. 16 games allows even questionable teams to make the playoffs. Then, the 1-game playoff structure allows these questionable teams to pass through to become champions. Of course, the big caveat is that the likelihood a questionable team makes it all the way and a great team doesn’t is still low… but it’s not as low as in other sports.

Sample size—which stats geeks shorthand as ‘n’ for the word ‘Number’—is needed to make a conclusion from a small set of observations. You may not realize this, but you do this every day in your head instinctively without using any calculators—such as whether or not to see a movie because a certain director directed it. Your head is processing this query: “How good have this director’s past movies been?”

And scouts do this every day in their jobs. The NHL scouts, for instance, observe in-person a mid-ranked prospect about 10-15 times per year. That 10-15 times is their sample size from which they make a year-end conclusion: “does the player have what it takes?”. A big caveat here is that scouts will use more than just their in-person observations to make this assessment. They will, in their heads, combine data from game statistics, physical assessments, psych assessments, and other scouts’ opinions to ultimately derive their final, intuitive, judgment. Altogether, they end up making an assessment on a player based on a lot more than seeing the player only 10 times. We’ve provided the NHL with some information on what the optimal number of observations needed is to make an assessment on a player assuming you have this other information to combine with it; in a future post I may share this research). Oh, and we’ve also found that the NHL Central Scouting group is good at making this intuitive judgments without the use of sophisticated statistics. It shows that for all of the brain’s cognitive liabilities, it also is a very powerful tool in the absence of calculators.

Now, back to sample size.

What is a good “N”?

This is a point of debate and frequently misunderstood by recent college graduates. The rule of thumb I hear most is “you want 35 observations and you’re good”. I sometimes hear another rule of thumb, “you want 10% of the total population” (e.g., if there are a million people and you want to know their ice cream flavor preferences, you need 10% of that, or 100,000, in order to make the judgment). Both are good rules of thumb but have their weaknesses. Statistically, you do not need 100,000 people answering a survey about their ice cream preferences in order to make an assessment about which flavors are most preferred (besides, we all know it’s chocolate anyway!). And likewise, if you only have a total population of 20 people, getting 35 observations is not possible in the first place.

Again, having a rule of thumb is important—we use it all of the time at EXACT—but ultimately, how large your sample size is depends entirely on what you’re trying to evaluate and who you are doing the evaluation for. Typically, a larger sample size means a more costly evaluation study (it isn’t financially or time-wise realistic for scouts to see a player play every single game!). You have to make a trade-off: how much statistical confidence and margin of error is acceptable to you? Having 1 observation is better than 0 observations; but that 1 observation may be non-representative of reality. Having 2 observations is a lot better than 1, but again, 2 might be a small sample size from which to estimate reality. If you’re making a life-or-death decision, you’re going to want as high a sample size as possible, aren’t you? But if you’re making a decision about which restaurant to try out, it probably doesn’t matter as much to wait for the number of restaurant reviews on Yelp go from 10 to 11.

In football, we’re talking about 16 games by the end of the season with 8 out of 32 (25%) of teams making the playoffs. It’s not uncommon to see teams with a 9-7 record make the playoffs. One more game and they might not have made it. One fewer game and they might not have. This happens occasionally in other sports—but it is very common in the NFL. It’s sort of a ritual towards the end of the regular season to see a bunch of “what-if” scenarios presented by the broadcasting team… e.g., “If the Jets win this game, and the Chargers lose their game, while the Steelers win their game, then the Broncos make the playoffs.” Keeping track all of those scenarios can give fans more painful headaches than ice cream brain freeze. But it also makes for some excitement.

Once those teams are all squared away in the playoffs, the rules then change. This happens in all sports. Baseball and hockey mitigate the “risk” that a bad team makes it through to the finish because they hold a best-of-5 and best-of-7 series. Football holds 1-game playoffs. In a later article I’ll discuss how individual football game outcomes are more open to chance than you might think, too (mostly due to injury risks and the degree of position specialization). Taking all of this together, Super Bowl champions are more random than in other popular sports.

A Fun Example

When hanging out with friends discussing baseball and my hometown team’s
perils (the Pirates are on track to finish with a losing record yet again this year), I sometimes like to push some buttons by suggesting baseball should move to a 16-game season. Of course this isn’t practical on any level, but it’s more just to get thinking about this situation. Over the course of a 162-game season, the Pirates’ deformities are clearly presented. There is long enough time to really assess the quality of the team. And they play the same team multiple times (at least 3 times). If they make the playoffs, it’s unlikely to be because of chance.

Looking between 1996 and 2009 I evaluated how the first 16 games of a season have gone for the Pirates. My assessment is not terribly scientific… this is just a quick ‘back of the envelope’ snapshot. The biggest failing of this quick analysis is that within those first 16 games, they are playing only a sample of about 5-6 teams. Furthermore, unlike in football, the outcome of any single game is a lot less meaningful (e.g., if the Pirates start out with a 2-0 record, the local newspapers are not likely to start saying the team is destined for the World Series. And the team is not likely to start making any major investments to get there).

But here is the analysis. In parentheses I put the team’s final win percentage for the season:

1996: 8-8 (.451)

1997: 8-8 (.488)

1998: 7-9 (.426)

1999: 8-8 (.484)

2000: 6-10 (.426)

2001: 6-10 (.383)

2002: 11-5 (.447)

2003: 8-8 (.463)

2004: 7-9 (.447)

2005: 5-11 (.414)

2006: 5-11 (.414)

2007: 6-10 (.420)

2008: 7-9 (.414)

2009: 9-7 (.385)

You’re probably looking at this and saying to yourself that the Pirates started off poorly in pretty much every season. This is close to true. As with most things, there is some nuance to this analysis. For starters, I’m not claiming that 16 games is not, in any sense, a predictor of team quality. It is. I just would have less confidence in a 16-game season being a predictor of quality than I would a 162 game season. So take the Pirates:

In 2 of the years the Pirates had winning records (9-7 and 11-5). In Football, those records are often good enough to make it into the playoffs.
In 6 of these years, the Pirates had at least a .500 record.

I’d venture a guess that the Pirates’ decision making would have altered greatly if we had a 16-game season. For example, in 1996 and 1997, if the team ended with a .500 record, it’s easy to hypothesize that the Pirates would have invested more money into their organization to push them over the edge for the following years. Whose to say, then, that the Pirates wouldn’t be a contending organization in 1998?

Before any stats heads out there criticize this analysis, I want to disclaim that again, this was just a back of the envelope calculation. Indeed, a Pearson correlation shows some moderate relationship strength between the first 16 games and the final season record (.38), but this is not a statistically significant sample size to begin with (p=.18). Like I said, there is some nuance here: will 16 games be somewhat predictive? Probably. But does a 16-game season give opportunity for bad teams to make it to the playoffs? Yes. The Pirates would have been playoff contenders in 2002, when they in fact ended 17 games under .500 and in 4^th place in their division. To resolve some of these issues, EXACT will undertake in a future study a more complete analysis that evaluates more teams and more seasons. This was only for example’s sake.

Lessons for NFL Fans, Teams and Broadcasters

Don’t over-react. Sometimes we see coaches fired or players dropped just after 1 or 2 games in football. That’s unlikely to happen in baseball. The same goes for broadcasters and gamblers who are making their bold predictions. So what if a team has won 2 games in a row, or lost 2 games in a row? It doesn’t mean that much in assessing team quality.

NFL fans are fortunate in that they have a league with decent parity compared to the other major sports. It isn’t just the CBA—it’s also the league structure. If the NFL and players union come to terms that reduce the competitive balance of teams via a higher salary cap, the NFL will still be in a good position to have parity. It might reduce competitive balance but it will probably not ruin competitive balance. That’s good news for both players and owners.

Future Studies

This was just a simple overview of sample sizes and the NFL. As with anytime we do some simple analyses and form some hypotheses, I find more questions to ask. Here are a couple more questions I have about all of this and how I would go about answering them:

How predictive are the first 16 games of an MLB season in determining final outcome? For this I would need to increase the number of teams I look at in all divisions of baseball and across more seasons. I’d also investigate adjusting for the fact that teams play each other in 3-game series, so I would consider that instead of taking simple win-loss records after 16 games, I would take the win-loss records of the first 16 series played (the downside is this would account for about 48 games of performance).

It would be interesting to assess parity in the NFL by looking at salary expenditure and win-loss record. I’d want to do a multi-variate analysis seeing how important each variable is in predicting outcomes. I’d also want to evaluate a pre-CBA and post-CBA league. The goal of this analysis is to really see what impact the salary cap has on parity versus the 16 game season. It isn’t entirely fair for me to say that if the salary cap went away there would still be some parity in the NFL. Without a real hard analysis to support this statement, the statement is really only a hypothesis. The Salary Cap (and minimum) is probably very important to maintaining parity, and I’d like to measure that.