In my training as a statistician, we talked a great deal about sufficient statistics. While the mathematics is complex, the concept is quite simple: a statistic is sufficient if it gives just as much information as the sample itself (i.e. there was no loss of information). For example, if one can tolerate the assumption of a normal (bell-shaped) distribution, we can reduce a hundred observations to three numbers: the sample size, the mean, and the standard deviation. Instead of considering 100 numbers, I can consider three numbers without losing anything. Such a dimensional reduction is the bedrock of the statistical method, as we strive to reduce a large sample to a few summary stats.
Returning to the subject at hand, let's say we're interested in how often a batter is walked (not beaned) in baseball. If we assume this distribution is binomial in nature (i.e. there's some constant probability of walking for this particular batter), we can use the number of plate appearances (PA) and the bases on balls (BB) to estimate the probability of a 4 ball walk (BB/PA). So BB and PA together contain all the information that we could glean from the play-by-play and are therefore sufficient statistics in this scenario.
What about the more complicated scenario of batter hitting tendencies? We might assume that there is a multinomial distribution governing a player's chance of hitting to the various bases. Again, we can estimate the respective probabilities governing this distribution by looking at all at-bats (AB) of this batter and compared to the number of singles, doubles (2B), triples (3B), and home-runs (HR). Though singles are not recorded individuals, they are included in total hits (H), so the number of singles can be computed by complement logic (S = H - 2B - 3B - HR). So we see these statistics are sufficient for the theoretical base distribution of any particular hitter.
Let's now shift our focus to an insufficient statistic in baseball: RBIs or Runs-Batted-In. The RBI attempts to capture the contribution of the hitter to driving base-runners home, but we don't know if players were batted in from 1st, 2nd, or 3rd. So there is more information in the play-by-play (the sample) than is contained in the statistic (RBI), ergo this is an insufficient statistic. To improve RBIs to a sufficient statistic, one might split it into three sub-categories (respectively RB1, RB2, RB3).
Now if tracking runs batted in is important then surely we would care if a player advances a base-runner into scoring position. However, with the exception of stolen bases, the movements of base-runners is completely ignored by traditional baseball stats (surely some MLB Analytics sites track this, comment below if you know of one). So we might add some statistics like first-to-second (1T2) and first-to-third (1T3) and for the sake of completeness second-to-third (2T3).
Logging base-runner movements with the addition of these six statistics would go a long way in capturing the value a hitter brings to the team in terms of advancing bases. Similar logic might be applied to stolen bases, as it would be nice to know if runners were successful or caught stealing from 1st to 2nd, 2nd to 3rd, or even 3rd to home. I'm hoping to finish a demo of this data (restricted to the American League Games) to illustrate the potential value of these statistics by opening day, so follow or bookmark this blog and check it out early next week.
No comments:
Post a Comment