Open Side Menu Go to the Top
Register
baseball probability question baseball probability question

10-29-2012 , 12:42 PM
Quote:
Originally Posted by BruceZ
I didn't assume that. I said that it is not reasonable to think that batter will do better against that pitcher than he does on average against the league, and to say that is true most of the time across all such pitchers and batters.
My problem is with your use of the word "better". That implies a transitive situation. At the other extreme, in a cyclic situation, it is quite reasonable to assume that since Rock beats Scissors, Paper will do better against Rock than Scissors. If strikeouts were a matter of pure ability, then it's reasonable assume transitivity. If strikeouts are the result of a game theory equilibrium, it's more reasonable to guess cyclicality. Actual baseball strikeouts have elements of both, which is why I was agnostic until I looked at data.

My other problem is discussion of league averages introduces information not included in the original problem. Nothing in it suggests the historical data we have is drawn from a closed group, or that we have data from that closed group.

Quote:
Originally Posted by BruceZ
When I made my original post ITT, I was not aware that you had done any empirical study to justify your assumption.
That narrows our disagreement considerably, although I'm not sure how you thought I came up with 0.7 and 0.3. It also seemed as if in prior posts you were saying your position had to be correct, you didn't need to look at data.

Quote:
Originally Posted by BruceZ
I thought that you were saying it is reasonable for theoretical reasons, and I see no reason to think that is reasonable. If you did some empirical study before you assumed that, or if you have done some since then, that's different, and you should be able to tell us just how often that number is between x and y across all matchups. I would be surprised if it were not often outside this range.
Now I'm not sure again. Does "no reason to think that it is reasonable" mean you are agnostic, or that you would be very surprised to see data that supported it? My guess would have been in favor of the result I found, but I had no strong expectation either way.

Of course you will often get ex post results outside the range, that's a straw man. In any single at bat, the result will be 0% or 100%, which are outside any reasonable range for ex ante probability. The way I tested things was with averages of large numbers of matchups, as a proxy for ex ante probabilities. If the top 10% of batters in terms of strikeout frequency struck out more often when facing pitchers in the top 10% of strikeout frequency than when facing pitchers in the bottom 10% of strikeout frequency, that would be evidence against my claim.

Quote:
Originally Posted by BruceZ
We also need to know the conditions under which we should not make that assumption because you have already identified a matchup where it would fail very badly when the best pitcher faces the worst batter.
I identified a single matchup in which it failed, which covered only 20 at bats. And while Verlander would be a candidate for best pitcher, Adam Dunn is closer to the best batter than the worst.

The conditions under which we should not make the assumption are when we have more data, as in major league baseball; or when the situation is competitive and transitive, as if we were talking about RBIs per plate appearance rather than strikeouts.

My original statistics were based on some data I had handy: postseason 2011 stats for batter/pitcher matchups along with regular season frequencies. That was small and nonrandom, and my analysis was broad brush. I regressed actual strikeouts on strikeouts predicted from number of opportunities times pitcher's regular season frequency and number of opportunities times batter's regular season frequencies. I got a positive intercept (but with a 95% confidence interval that included zero) and coefficients that were within two standard errors of adding up to one.

Here is a better study that I just did. It doesn't answer the original question, because I don't have enough batter/pitcher matchup data. But I think it's the same theoretical situation.

Suppose a batter has a career strikeout fraction of X, excluding the current year. Suppose the league average strikeout fraction this year, excluding this batter, is Y. What do we expect the batter's strikeout fraction to be this year. I think you would argue that we should at least have a strong bias toward assuming that if X and Y are above the average for all batters over all years this batter played (excluding the current year) then we expect a strikeout fraction higher than both X and Y.

The results show otherwise. Using all data since 1946, we get a regression of 0.874*X + 0.091*Y + 0.767, so the predicted number of strikeouts this year for this batter is (0.874 times his lifetime fraction excluding this year plus 0.091 times this year's average excluding this batter) times his number of plate appearances plus 0.767 (regression on the fractions directly, instead of predicted total strikeouts, puts far too much weight on the batters with few appearances). The standard errors are small, 0.003, 0.002 and 0.044 respectively, and the R^2 is 94%. So we can assert with some confidence that at least over the entire population, the probability of strikeout is between X and Y, because the constant term is significantly positive, and the coefficients on X and Y add up to significantly less than one.

The table below shows data for six subpopulations. The bottom 1% of the batter/year combinations in terms of strikeout frequency, that is the 1% of the batters with the lowest lifetime strikeout frequency in the years with lowest strikeout frequency, were predicted to strikeout 4.1% of the time based on batter's lifetime frequency, 11.6% of the time based on the year, and in fact struck out 4.7% of the time, in between the two predictions. For each subpopulation, the actual frequency is in between the two predictions. I looked at lots of other subgroups, and cannot find any results outside the two predictions. I'm sure you could, either by data mining or picking very small subgroups, but overall I think the data are pretty clear.

X Y Z
Bottom 1% 4.1% 11.6% 4.7%
Bottom 10% 7.7% 13.3% 7.9%
Bottom 50% 13.1% 14.6% 13.2%
Top 50% 27.6% 15.7% 26.6%
Top 10% 53.7% 15.3% 41.5%
Top 1% 68.1% 16.1% 46.8%

Now I have a question for you. I could do the same study using pitcher's lifetime statistics instead of batters. What do you guess I would find? The same results or a regression intercept less than zero with coefficients on X and Y that added up to more than one?
baseball probability question Quote
11-04-2012 , 02:03 AM
What you found there is real, but it only addresses a small part of the point. To start with a poker analogy, if you assume a single-peaked talent distribution (which is fine for MLB, maybe not for poker, but run with it) and take everybody who played 1000+ hands at some level on some network during October and look at their first 1000 hands, the worst results are most likely to be bad players who got unlucky. It's possible, but much less likely, that you have on-average good players who got super unlucky. So if you ran those bottom feeders against a marginally tougher lineup for a second 1000 hands, you'd expect them to do better because their "regression" to normal expected luck would more than offset the marginal increase in difficulty.

And you found the same thing in baseball- the variation in player K% skill is much, much larger than the year-to-year variation in league average K%, and the high K% players are likely to be ones whose true talent is to strike out a lot and who also ran bad. So when you marginally adjust the league average (which isn't perfect either, baseball is annoying), the biggest effect is going to be regression to normal luck (or more average pitchers.. or more average umpires.. or more average true talent.. baseball is annoying). So it's not enough to say a high-K batter against an above-average K pitcher will lead to a higher rate- the pitcher needs to be enough above average at King people to overcome the regression inherent in any measured batter and pitcher K%s, and you can't pick that up without counting specific batter-pitcher matchup results.
baseball probability question Quote
11-04-2012 , 01:51 PM
I agree, regression toward the mean is a part of this.

To make is more general, suppose I have data on some score for pairs of people. If I tell you person A had an average score of 1 in his previous pairs, and person B had an average score of 2, your first guess is likely to be that A and B together will produce a score between 1 and 2. With no information to weight A or B more heavily, you might start at 1.5.

Now suppose I add the information that the average for all pairs is zero. This could affect your guess in two ways. First is the regression toward the mean, both A and B are more likely to have scored higher than their long-term means than lower in the past, this would tend to move your guess lower.

On the other hand, if we know both A and B are above mean scorers, we might expect their combination to add their influences rather than average them, leading us to guess something like 3, or in any event, move our guess higher.

To come up with a reasonable guess, it helps to use knowledge about the specific application.
baseball probability question Quote
11-04-2012 , 02:03 PM
The sabermetric formula posted earlier in this thread produces results that are either above or below both x and y when the pitcher and batter are on the same side of the league average.

Quote:
Originally Posted by whosnext
Bill James introduced his "Log 5" formula in one of his early Baseball Abstracts to estimate one team's expected winning pct against another team. The formula has been expanded to apply to situations which are not anchored at .500 (like team winning pcts). You can search for Tom Tango or TangoTiger or the Odds Ratio Method.

Basically the formula is:

R = (B*P/L) / ( (B*P/L) + (1-B)*(1-P)/(1-L) )

R Result
B Batter Stat
P Pitcher Stat
L League Avg
With P=0.1, B=0.15, L=0.2, the result is 0.073. Same if you reverse P and B.

With P=0.25, B=0.3, L=0.2, the result is 0.36. Same if you reverse P and B.


Quote:
Originally Posted by AaronBrown
That narrows our disagreement considerably, although I'm not sure how you thought I came up with 0.7 and 0.3. It also seemed as if in prior posts you were saying your position had to be correct, you didn't need to look at data.
The way I read it, you started by assuming that the result was between x and y for logical reasons, and then looked at the data to determine the variation due to the pitcher and batter in order to come up with 2 numbers that fit that assumption.


Quote:
Now I'm not sure again. Does "no reason to think that it is reasonable" mean you are agnostic, or that you would be very surprised to see data that supported it? My guess would have been in favor of the result I found, but I had no strong expectation either way.
It means that it is reasonable to think just the opposite is true, based on effects that you describe as transitivity, and now the sabermetric formula also seems to support this. So if it is reasonable to think that it could be outside x and y much or even most of the time, then it can't be reasonable to assume that it is always between x and y for logical reasons, and yes I would be surprised if that turned out to be a very good assumption.


Quote:
Suppose a batter has a career strikeout fraction of X, excluding the current year. Suppose the league average strikeout fraction this year, excluding this batter, is Y. What do we expect the batter's strikeout fraction to be thll batters over all years this batter played (excluding the current year) then we expect a strikeout fraction higher than both X and Y.
I don't think it makes sense to compute X over an entire career and Y over a single season. I think we should compute them both over the same season.


Quote:
If the top 10% of batters in terms of strikeout frequency struck out more often when facing pitchers in the top 10% of strikeout frequency than when facing pitchers in the bottom 10% of strikeout frequency, that would be evidence against my claim.
Did they? For a single season?
baseball probability question Quote
11-04-2012 , 09:42 PM
Now you're hurting my feelings. When I posted essentially the same formula for the case L = 0.5 in another thread it was nonsense. But when it supports your view it's a sabermetric formula with the authority of Bill James.

When L does not equal 0.5, the formula is silly. If L is very small, it gives an answer near 1, regardless of P and B. If L is near 1, it gives an answer near zero, again regardless of P and B. Or set L = 0.05, P = 0.4 and B = 0.4. The largest reasonable guess in the absence of a strong model seems to me to be 0.75 (adding the two deviations to the league average). The model gives 0.93.

The effect of the formula is clearer if you write P = L + dP and B = L + dB. Then you get a prediction of:

(L + dP + dB + dP*dB/L) / [1 + dP*dB*(1/L + 1/(1-L))]

If dP and DB are small you can ignore the dP*dB terms, so it's just an additive model predicting L + dP + dB. As dP and dB get larger, the second order terms kick in to keep the answer between 0 and 1.

So let's go back to the reasonable case when L = 0.5 (not the league average is 0.5, just set L = 0.5 in the formula so you can ignore it). Then the formula gives the exact answer to the question: suppose the pitcher flips a biased coin with probability of strikeout P and the batter flips an independent biased coin with probability of strikeout B and they keep flipping until their coins agree; what is the probability they agree on strikeout?

Is that a reasonable model for strikeouts? It's not obviously bad, there are things that work like that. That's why I suggested it in another context. However my intuition and data suggested to me that strikeouts don't work like that. They are not the product of a pitcher's ability to strike batters out and a batter's skill at avoiding strikeout; but a game theory optimum in which a pitcher can use his skill either to get strikeouts or to avoid extra base hits, and the batter can use his skill either to get on base or try for extra base hits. In this kind of situation, you don't expect additive effects.

The reason I gave to guess the strikeout rate would be between P and B is that no information was given about baseline. With a league average, and the stipulations that P and B were measured in league play and we're trying to predict a league at-bat, it obviously changes the appropriate estimate. However, empirically, the answer still seems to be between P and B, at least for broad groups of players.

Why doesn't it make sense to compare strikeout predictions based on a batter's career excluding a season, and one based on the league fraction excluding the batter? We can't use a batter's single year for both, because we're using the batter's single year to test the prediction. I don't say this gives the same answer as predictions based on pitcher's rate and batter's rate, but I don't have a good data set for that. I only claim that it's the same theoretical situation. You have two estimates of a strikeout percentage and the question is how to combine them. The data suggest you get an answer between the two estimates, even if both estimates are above or both estimates are below the overall average. Why would you use a different theory for this case than for the pitcher/batter case?

I only have a small and non-random sample of data for pitcher/batter matchups. But that data is consistent with strikeout frequencies intermediate between pitcher's frequencies and batter's frequencies. But I have no strong opinion that better data would support that point. It was my first guess only when no baseline information is available.
baseball probability question Quote

      
m