Note: This was originally posted at acthomas.ca. Further updates will be posted to this entry.
Summary: In which I introduce the use of multiple factors to produce the rates of events that occur in hockey.
The main purpose and major obstacle of single-number summaries is finding a common currency to relate each event within its proper context. We can only do that with counting measures when we find a way to adjust one unit to another in a sensible way, and the easiest way that I know is to express things in terms of the rates at which events happen. And there are two processes at work in any hockey game: when each team tries to score goals on the other. So if we know the rates at which these events are expected to occur, we have a baseline; better yet, we can see how circumstances like game score, home ice advantage and more tend to affect these rates during the game. That is what I’ll call Point 1:
The relative value of an agent — a team, player, combination of players, or circumstance — is how they change the rate at which events occur, for and against.
This is of course meaningless to our purpose without Point 2:
The only events that matter are predictive or indicative of a goal being scored.
Neither of these should be disputable if what we’re trying to do is predict the outcomes of games, before or during.
Now, any rate needs a time to calculate, which is one reason why I’m sure this hasn’t seen a lot of use in the community: before the last 10 years we didn’t have a lot of data that came with “time elapsed” attached; it had to be inferred based on other observations, and it certainly wasn’t cross-registered with the player lines on the ice. But with over 10 years of decent-quality data there’s plenty we can do; even the 6 seasons in which missed shots were shared in the Real Time Scoring System is enough to do what we need. First, let’s start with a simple baseline condition:
Baseline: Home and Away Scoring at 5-on-5, With “Score Effects”1
This is fairly straightforward to do: count the number of events for the home and away team in any length of time and divide by the total time to get the rate. Breaking this down into a team leading, tied and trailing respectively, the rates are lower when tied and higher when trailing or leading:
Since we can do this with any event type, let’s go all-in and do it with Corsi events, or goals, saves, missed shots and blocked shots. The number of events is big enough that we can break the score into 5 groups:
These are the score effects that others have noted: shooting correlates well with negative goal differential. That it disagrees with the score effects on goals is interesting in that it suggests we should be tracking rates for events differently based on their scoring potential; the only way this works is if more shots, when taken by teams ahead, are converted into goals. The effect of this remains to be measured fully by this data, but I repeated this procedure by splitting by scoring zone (as in hextally) and the same shot pattern holds each season. This probably just means that odd-man rushes go up — I can’t infer that from the RTSS data, but it is clear that if we calculate shooting percentage data by location, it correlates with positive goal differential in each class; a more aggressive style of play by the catch-up team must produce shots that are tougher to stop if on goal.
None of this is new to us, dude. So What?
This confirms that rate measurement gives us a number of well-known and understood results. Here’s why we needed the beginning stuff: each of the events we care about changes the rate at which we can expect new events to occur, and these events mark when the changes occur, and what we should expect to happen next in the game. A team having possession of the puck in their offensive zone is expected to record a shot on goal far sooner than their opposition, but knowing what probabilities to associate that gives us the real value behind having that position.
The simplest model we have is that the rate depends on the last event we observe. If it was in the home offensive zone, then the home team should be expected to shoot more frequently than the away team in the short term. So we get something like this, with placeholder values for the rates:
We can look at each of these events individually to figure out what their rates should be, but we’ll be far more effective if we can combine these factors directly: if the home team has a two goal lead and the puck is in their offensive zone, at what rate should their next shot be expected? At what rate should their opponents’ next shot arrive? And what would change the rate next?2
Let’s take the first cut of estimating models for home and away Corsi with two multiplicative factors:
[rate of events] = [baseline rate by score] * [adjustment for known zone]
At any point in the game, three things can be about to happen: the home team records an event, the away team records an event, or something else happens that censors our observation; since only one of these can happen, at least one of these events can be censored by the other. We solve for the situational rates simultaneously using maximum likelihood estimation and the data from the last 6 NHL seasons.3 The best solution for the baseline rates is around 1 Corsi event per minute for the home team, 0.95 for the away team, with a little more when trailing and less when leading, with respect to the puck being located in the neutral zone. When in the offensive zone, a team’s expected rate rises to about 1.44 of their baseline, and their rate against drops to about 0.36 of baseline, which seems reasonable. It also means that we’ll get these events in regular clumps, which the strict Poisson process model outright denies will happen. This elevated (or depressed) rate only lasts until the next event is recorded — a hit, change, shot, turnover, etc. — at which point we re-evaluate. An event happens roughly every 15 seconds in the RTSS so this “super mode” doesn’t last long on its own.
There’s one easy factor we haven’t added yet that we know: which team won a faceoff in a particular zone, to give us something on the value of possession. Of the six possible results, guess how they break down?
The short term value of possession is clear: for the goal scoring process, having possession in the offensive zone is the only combination predictive of scoring in the short term. (Which shouldn’t be too surprising.) On the flip side, we do see that in the short term, location is more valuable than possession; even losing an offensive zone faceoff means that your team will have a higher chance of scoring than your opponents who won it but remain in their own territory.4
Next stop: How we combine these factors with team and player information, auto-correcting for zone starts, known positions and quality of competition and teammates.