The Road to WAR (for hockey), Part 1: The Single-Number Dream

Note: This was originally posted at acthomas.ca. Further updates will be posted to this entry.

I’m extremely heartened by a new-found appreciation for statistical methods in hockey, by teams and fans alike, and somehow gifted with a small amount of time to collect a number of thoughts on the current state of the field.1 One of the biggest questions along the way is the best way to summarize a player’s accomplishments in one number.

The blogger community has focused on largely descriptive statistical counting measures and fractions that have individual predictive power, but while being less clear about those actual predictions are than I’d like.2 The statistical community, myself included, have put out plenty of work on how the game might flow, but we haven’t done nearly as well about explaining our methods to a larger audience, or assuming that the tools are there. I’ve tried to do that with nhlscrapr, to make the NHL RTSS data easier to process, and with hextally, to display shot data with respect to league averages, but there’s more I can share that will help to bring the quicker-moving world of bloggers closer to tool-builders like me.

Over this and the next sequence of posts, I’ll lay out my vision for how these pieces can fit together in creating a general formula for Wins Above Replacement in hockey, a single currency to compare all parts of the game, with the best level of data we have available to us at any time. I have a few guidelines I want to follow:

1) This system should be forward-looking; that is, no new information intrinsic to the system should affect our estimates from the past. I want this to be based on a predictive idea so that past performance is indicative of the (immediate) future; my only exception to this would be if we learned of bias in the data which needed to be corrected after the fact.

This limits our use of standard regression tools to disentangle the impact of different factors, if it means that we have to process the entire data set all at once, mixing past and future through matrix inversion. Ain’t gonna happen that way.

2) Every piece should be linearly decomposable into its constituent parts. Part of this is that if one piece can be improved — such as goaltender performance — we can just hot-swap that piece with the improvement.

3) If possible, our methods should depend on generative models so that we can simulate game outcomes as closely as possible. Partly, it validates these statistical methods by letting us see how a game could come about; partly it gives us transparent prediction and forecasting.

4) Relating to the previous point: everything should be validated based on its ability to predict future outcomes on a grander scale. We shall not judge based on eyeball fit3 but by overall measures of predictive scale.

5) No magic numbers — that is, constants in the equation that appear because we needed something to fit them, the idea that a replacement goaltender’s save percentage should be 0.900 without justfication. Some are unable to be avoided in the short term — like, say, the 10 worst goaltenders in the league are “replacements” — but the more we can justify these through design choices, or fit them explicitly through data, the better. At best, these numbers are placeholders.

6) Numbers should be independent of managerial choices. This is tougher than it sounds, since the best players get the best ice time and scrubs never see the power play, but it’s as much a commitment to form as it is a promise. For example, it’s unfair to judge a pitcher’s ability as lesser because they happen to pitch a scoreless 6th and not a scoreless 9th; the manager made that call.

7) If we can’t see the counterfactual case, we can’t include it. For now, that includes hits and turnovers, because I don’t have enough data to see what would have happened if the player didn’t make the hit or didn’t cough up the puck. (See point 5 — I can’t simulate it.) Maybe with better data, but not today.

With those guidelines set up (and ready to be broken if necessary) it’s a little more straightforward to describe what follows what rules that we’ve seen from previous systems. First, let’s assume that goaltenders are easy: Goals Against Average is a team metric far more than an individual one, and basic save percentage does a decent job. We can (and will) do more later.

Goal Plus-Minus — Various (1950s-present)

This is worth mentioning purely as a baseline value: a player’s goal differential is predictive of their future performance, because it’s the quality that matters the most to the game. Unfortunately it’s a bit akin to the idea that eating fat will make you fat because they’re made of the same stuff; the lack of sample size and the confounding with your linemates are both problems that make this essentially unusable as a true comparative statistic.

Pros:
Easy to calculate.
Minimal amounts of data are needed compared to the other methods.
Easiest appreciation of defensive ability around.

Cons:
Essentially even strength only.
Way too sensitive to team effectiveness.
Doesn’t blend well with shooter statistics (goals and assists).

Bear both these points in mind as we move forward.

Player Contribution — Alan Ryder, 2003

This is the earliest attempt I’ve found when trying to assess total value; it works on the concept of Win Shares, in that players who are responsible for success above the baseline receive credit in terms of “Marginal Goals” which are then converted to wins as a base currency.4

Pros:
Was first (and is criminally unknown by the community at large).
Uses commonsense methods for calculating goaltending prowess.
Accounts for penalty-taking and penalty-killings in its assignments in everything piecewise.

Cons:
Uses magic numbers, at least in its original incarnation, particularly for threshold values. Does not adjust for relative skill of players (opponents or teammates).
Uses goals as the currency for players, not expected goals or other goal precursors.

it’s a good starting point, because it breaks everything down into its component pieces. And it certainly has pieces that are worth saving and preserving; once the game is broken down like this, it’s tough to avoid. This leads to…

Goals Versus Threshold (GVT), 2009

A reasonably popular measure used by Hockey Prospectus, this has many of the same elements as PC above, with one exception: regular GVT scores tend to be published. Otherwise, it estimates scoring contributions to each player on the team on ice when goals are scored, gives extra credit to goals and assists, and does not use expected shots or account for teammates. Both of these are mission-specific: the kind of detail each author proposed for these was so that the method could be applied across decades. Which is wonderful and admirable for all us geeks who want a fuller picture of the game over time, but less useful for prediction given what’s collected today.

So that leads to our modern-era predictors, or our best one (or two) shot numbers:

Modern Plus-Minus: Fenwick Close, 2011 (ish)

Take all missed shots, blocked shots and goals for and against a team when a player is on the ice, and only those when the game is “close” (with varying definitions, but essentially within one goal). There are a few more of these around – varieties known under the names Corsi and Fenwick, for their coiners — but this would seem to be the most respected version of those modern pure plus-minus statistics. I include this not only for completeness but to point out that many critics of these methods will falsely suggest that their proponents actually treat it as a one-number be-all and end-all.

Pros:
Captures a lot of scoring-type events that were otherwise ignored.
Removes noisier/known-to-be-biased observations.
Popular.

Cons:
Equates shots and goals as equal contributions.
Ignores shot quality (strongly by choice; many might call this a pro).
Still doesn’t directly account for teammate and opponent collinearity.
Removes potentially useful observations.
Popular with people who call this “possession”, even though it’s a necessary combination of possession and offensive-zone location and a direct goal precursor.

There have definitely been attempts to correct the shortcomings of these numbers, such as linemates and opponents, by collecting the same statistic for said ice-sharers and taking the average weighted by shared ice time for each of these, known as Quality of Teammates and Quality of Competition respectively, but these aren’t integrated with the original statistic, merely used as an “eye test” to gauge whether they are higher or lower.

Here we’re getting into methods that could actually be considered “advanced” since they go beyond simple counting statistics and into a little more theory. Each of these three methods takes an individual event as the outcome of interest and uses linear or logistic regression to establish how each contributing factor — in this case, the players on the ice, and other factors like zone position and (particularly) home ice advantage.

These models produce coefficients for each player or term in the model; Goals above Replacement come from examining the change in probability or expectation across all events and adding up the relative difference.

Pros:
Each rating has teammates and opponents included by default.
Incorporates multiple event types (THoR, at least)

Cons:
Everything has to be done all at once; estimates of ability will change, meaning old estimates change with new data.
Can be computationally expensive; tools are standard but data takes some doing (mainly due to massive sparsity).
Models are not generative.5

G-net through the Mean Even-Strength Hazard model — Thomas et al, 2013

My team and I designed our approach to fix some of our dissatisfaction with existing models — essentially all the models we’ve just covered — though we have our own weaknesses as well. I’ll explain in more detail in the next posts how this works, but the premise is that hockey is a game of events on two goals: a strong offensive player raises the rates of their team scoring goals, and a strong defensive player lowers the rates of their team allowing goals.

Pros:
It can be used to actually simulate games, so the terms are directly interpretable.
Directly incorporates teammate and opponent effects, and other changes in the state of the game.
Uses time on ice directly rather than as regression weights.

Cons:
SLOW. It took running overnight on a multicore processor to calculate everything we needed the first time, even though parts of it were written in C.
Built for an academic audience, rather than general interest (befitting our day jobs)
Needs the full slate of RTSS data to work properly.

Another con: my colleagues and I have done a pretty poor job of sharing our results on a rolling basis. That’s something I intend to change with this series.

UPDATE, August 13 2014: Rob Vollman of Hockey Prospectus gives us this list.