An Evaluation of the Total Hockey Rating (THoR), Part II

This is the second part of a series that looks in more depth at the Total Hockey Ratings that we developed. In Part I, I looked at the year to year correlations for THoR, Gabe Desjardins’ Corsi Rel and (HART) . We found that THoR was a more reliable predictor than the other two metrics. As I mentioned at the end of that post, this internal consistency is good but for a metric to be useful it has to also be valid. We’ll see below that THoR had a good deal of validity. Validity being a measure of whether or not a metric

Before we get to the validity part, I wanted to share a couple of other results that we have now run. I had initially resisted doing a even-odd sort of out of sample correlation because our code for calculating THoR is in python (and still it takes about 10+ hour to run a years worth of data) and it would have been a nightmare to recode all that for a separate analysis. My co-author Jim Curro wrote the python code from some original R code. However, recently, I thought of a brilliant shortcut using some of the output we get from the python code. Only took a couple of months to figure that one out but so be it. Anyway, I was able to get it running. I took the 2010-11 and 2011-12 seasons and took every other play even strength play. Given the collinearity among linemates (linemates that play together often), we prefer to use one year of data as a minimum. For players who were on the ice for a minimum of 1000 plays, the correlation was 0.686. For players who were on the ice for a minimum of 2000 plays, the correlation was 0.724. For 2009-10 and 2010-11, the same alternating play analysis gave 0.681 and 0.708, respectively. Combining these statistics with our year to year correlations in Part I, it seems what THoR measures is a repeatable skill.

Speaking of repeatability, here are the coefficients for the NP20 score effect, for home ice and for a zone start by season using just data from an individual year. The values are tiny because they are the per play values. However, these should give you a sense of the consistency of some of the effects in our model. Over these four seasons the NHL has been pretty consistent.

Table 1: Year to Year estimates of effects

Season	Score Effect	Home Ice	Zone Start
09-10	-0.0008	0.0012	0.0056
10-11	-0.0004	0.0012	0.0057
11-12	-0.0012	0.0012	0.0051
12-13	-0.0006	0.0008	0.0055

Now we still have this question of whether or not THoR is measuring something that is related to winning. Recall that the THoR model is based upon a ridge regression that includes players but also the factors: home ice, rink, quality of teammates(QoT), quality of opponents(QoC), zone starts(ZS) and score effects. All of those are information we have about each play at even strength. Our outcome (or response) metric is NP20. NP20 is the net probability that a given event leads to a goal for the home team minus the probability that the event leads to a goal for the away team. We treat shots and goals as shots and calculate the probability that a shot was a goal plus the probability that it resulted in a goal in the subsequent 20 seconds. Hence the 20 in NP20. (We used 10 seconds when we started this process in 2008 but we now use 20 to be conservative to value of a play.) There are some other nice features to THoR. Have a look at the paper here for more details. We’ve also added some additional aspects to the model since the original paper. The THoR page has the latest on these. Player values for the home and away teams are entered as positive and negative, respectively, to agree with the NP20. Thus the THoR player rating is one that tries to extract the impact of an individual player over the course of the events recorded while they are on the ice.

We should note that we are using a proxy for validity here. We don’t have a ground truth of player value. No one does. Instead, what we have is whether or not a given measure can predict the winning of a hockey game. That’s good but not perfect. Recall that THoR is based on probabilities of events leading to goals and we know that goal differential is highly correlated with Points at the team level. So there is some inherent validity built in. To look at the validity of THoR through the lens of winning individual games, I summed NP20, which is the response for THoR, over the course of the first three periods of each game from 2009 to 2013. Limiting the analysis to games that ended in regulation, I tracked whether or not NP20 was a predictor of the winner. I threw out games that went to OT or a shootout so what we have here is only games decided in regulation. From the NP20 perspective, a team with a positive NP20 should be probabilistically outperforming the other team. I also did this for the same seasons for Corsi and Fenwick. The data for Corsi and Fenwick was pulled using the invaluable nhlscrapr package in R from Andrew Thomas and Sam Ventura. The results are below in Table 2. The proportions in the table below are the percentage of times that winning the XX battle led to the appropriate prediction of who won the game where XX is the given metric. That is, teams that out Corsi’d their opponents 5v5 within 1 goal won 59.1% of games.

Table 2: Comparing the Probability of Winning of Shots, Corsi, Fenwick and NP20

Conditions	NP20(THoR)	Corsi	Fenwick	Shots
5v5	0.519	0.530	0.461	0.406
5v5 within 2	0.537	0.568	0.520	0.452
5v5 within 1	0.573	0.591	0.580	0.493
5v5 tied	0.589	0.607	0.620	0.538

So each of these methods improves it’s predictability as we limit the conditions. At 5v5 tied all do their best. Score effects are clear from this. Fenwick at 5v5 tied is the best at 62%. Interestingly enough this tweet comes along recently from Josh Weissbock.

Holy crap… I’ve managed to increase my single game accuracy predictions using #machinelearning up to 61.5%

— GritHeartTruculence (@joshweissbock) November 21, 2013

So none of these methods is too far from that value though shots even 5v5 tied is well below the others. THoR’s NP20 is below Corsi and Fenwick but not by much and probably only statistically significantly below Fenwick. We, of course, have to recognize that there is some noise among these estimates that is not sampling. But at 5v5 tied Corsi and Fenwick beat NP20 if not signficantly in the case of the former. The point of this exercise was to look at the validity of THoR. Pretty clearly THoR is a valid metric, that is one that is related to winning. Additionally, the things in THoR that are not in Corsi — besides the ones in Table I — include things that are for the most part repeatable including faceoff wins and penalties. So there are good reasons to have them there for assessing player value. THoR is not the end all and be all and we’ll keep tweaking it. But in light of these analyses it is a metric that is useful for assessing NHL player value. Ultimately THoR gives about 20% more reliability while being a little lower on validity.

One logical next step is to consider a THoR-like model with a Fenwick or Corsi response. Or use some other similar response metric. We have begun looking into this and our preliminary findings are underwhelming. In order to get the year to year correlations to around 0.4, we have to shrink a very large amount. None of the range of models we have looked at so far yields year to year correlations close to the approximately 0.65 values for THoR. This highlights another advantage of THoR. Using every play gives large sample sizes after fewer games than Corsi or Fenwick. That is only an advantage if that data are useful and clearly in terms of evaluating player reliability using THoR they are.

In the next part of this series, we’ll look at some specific players through the lens of THoR.

Developing data-based decision making tools for a competitive edge