Note: This was originally posted at acthomas.ca. Further updates will be posted to this entry.
We’ve got the notion now that the rate at which events occur, and what happens to the game because of these events, drives my understanding of the game. And the last post mentioned a little bit about differing types of events, though we haven’t established anything definitive about effectiveness and how we should divide this up. The data we have available for the past six seasons has this sort of location information for shot attempts:
- GOALs and SHOTs have (x,y) coordinates and distances from the net.
- MISSed shots have distances from the net. (I impute these locations based off SHOT locations, conditioning on distance from the net; it’s not perfect but it’s a start.)
- BLOCKed shots have neither.
We know that different stadiums have biases when it comes to recording missed and blocked shots, and that there are a number of quality issues when it comes to the (x,y) locations of shots, but we do the best we can with what we have. One of the simplest approaches we can take is to bin the data into smaller blocks that are descriptive of general areas but insensitive to small changes. Trying to balance these areas for an equal number of shots is also difficult without making them less comprehensible — there’s a reason we have shorthand terms for the point, the slot and down low, after all — but if the count is reasonably close, we can still get a pretty good picture.
The current breakdown of zones used by Hextally is into 16 pieces:
This preserves the “home plate” area that is often used to categorize whether an attempted shot on goal is considered a “scoring chance”. Sixteen zones is great for doing a quick breakdown of shooting preferences, but for smoothing over the whole space to categorize team performance, it’s probably too many, so I’m going to group these events into three groups based on their empirical probabilities of success:
Note that the “home plate” is expanded back to the blue line down the middle, and I’ve added a “super scoring chance” area in the slot1; the leaguewide goal probabilities from these areas for unblocked shot attempts are 0.02, 0.06 and 0.13 respectively.
How To Split The Load
So now we’ve got 10 different sub-events going for us: BLOCKed shots, and three each for MISS, SHOT and GOAL depending on the zone. Since we’re talking about rates of events happening, there are two directions we could consider taking, which ought to produce similar results:
1) Measure one basic rate for all shooting events (Corsi rate), then look at each type of event as if it’s a subtype of that. The biggest pro is that if we really think that Corsi is a direct proxy for offensive zone possession, then it’s the biggest driving factor for variation, making our other estimates cleaner. The biggest con is that it shunts relevant information — puck location — to a lower level of the model.
2) Measure all four event types with their own rates; that is, BLOCK, LOW, MEDIUM and SLOT, then make adjustments to their goal predictive status later. If we had the locations of blocked shots, we could mix them in and improve our estimation, but we’re stuck without them for now.
I prefer option 2 mainly because the math will be easier, but also because it makes for easier story-telling: we’re explicitly measuring scoring chances and super-scoring chances against lower-probability chances, and cleaning up the rest later. Either way we go, it has to be established whether shot position matters both for immediate goal probability (which it clearly does) and for the prediction of goals in future games (which is less than clear), since it’s been argued that “shot quality” is of minimal predictive value; we’ll get to that in coming posts.2
Splitting these up by an observable characteristic also tells us something interesting about travel fatigue. On reading a couple of different analyses of back-to-back games, it seemed like a natural move to build this into the rate model: add an effect in each rate to adjust for a team playing for the second night in a row: home-after-home, home-after-away, away-after-home and away-after-away are the four possibilities. By splitting it into the three goal probability groups above, and solving for the most likely rate adjustments for each on offense and defense over the past 6 seasons at even strength, we get this:
The yellow squares show rates that are statistically significantly different from 100%, and the big whopper is under “Slot For”: teams on the second day of a back-to-back are on average 5% worse at attempting close shots (including close tip-ins or deflections), and teams on the second day of an away-after-away are also 6% worse at allowing close shots. Any schedule that loads a team with these is absolutely hurting the number of points they can score.3
With each of these background factors in place, we now have what we need to put together ability measures for individual teams and players. Stay tuned for part 4 where we’ll get into this!