Quote:
Originally Posted by BruceZ
I didn't assume that. I said that it is not reasonable to think that batter will do better against that pitcher than he does on average against the league, and to say that is true most of the time across all such pitchers and batters.
My problem is with your use of the word "better". That implies a transitive situation. At the other extreme, in a cyclic situation, it is quite reasonable to assume that since Rock beats Scissors, Paper will do better against Rock than Scissors. If strikeouts were a matter of pure ability, then it's reasonable assume transitivity. If strikeouts are the result of a game theory equilibrium, it's more reasonable to guess cyclicality. Actual baseball strikeouts have elements of both, which is why I was agnostic until I looked at data.
My other problem is discussion of league averages introduces information not included in the original problem. Nothing in it suggests the historical data we have is drawn from a closed group, or that we have data from that closed group.
Quote:
Originally Posted by BruceZ
When I made my original post ITT, I was not aware that you had done any empirical study to justify your assumption.
That narrows our disagreement considerably, although I'm not sure how you thought I came up with 0.7 and 0.3. It also seemed as if in prior posts you were saying your position had to be correct, you didn't need to look at data.
Quote:
Originally Posted by BruceZ
I thought that you were saying it is reasonable for theoretical reasons, and I see no reason to think that is reasonable. If you did some empirical study before you assumed that, or if you have done some since then, that's different, and you should be able to tell us just how often that number is between x and y across all matchups. I would be surprised if it were not often outside this range.
Now I'm not sure again. Does "no reason to think that it is reasonable" mean you are agnostic, or that you would be very surprised to see data that supported it? My guess would have been in favor of the result I found, but I had no strong expectation either way.
Of course you will often get ex post results outside the range, that's a straw man. In any single at bat, the result will be 0% or 100%, which are outside any reasonable range for ex ante probability. The way I tested things was with averages of large numbers of matchups, as a proxy for ex ante probabilities. If the top 10% of batters in terms of strikeout frequency struck out more often when facing pitchers in the top 10% of strikeout frequency than when facing pitchers in the bottom 10% of strikeout frequency, that would be evidence against my claim.
Quote:
Originally Posted by BruceZ
We also need to know the conditions under which we should not make that assumption because you have already identified a matchup where it would fail very badly when the best pitcher faces the worst batter.
I identified a single matchup in which it failed, which covered only 20 at bats. And while Verlander would be a candidate for best pitcher, Adam Dunn is closer to the best batter than the worst.
The conditions under which we should not make the assumption are when we have more data, as in major league baseball; or when the situation is competitive and transitive, as if we were talking about RBIs per plate appearance rather than strikeouts.
My original statistics were based on some data I had handy: postseason 2011 stats for batter/pitcher matchups along with regular season frequencies. That was small and nonrandom, and my analysis was broad brush. I regressed actual strikeouts on strikeouts predicted from number of opportunities times pitcher's regular season frequency and number of opportunities times batter's regular season frequencies. I got a positive intercept (but with a 95% confidence interval that included zero) and coefficients that were within two standard errors of adding up to one.
Here is a better study that I just did. It doesn't answer the original question, because I don't have enough batter/pitcher matchup data. But I think it's the same theoretical situation.
Suppose a batter has a career strikeout fraction of X, excluding the current year. Suppose the league average strikeout fraction this year, excluding this batter, is Y. What do we expect the batter's strikeout fraction to be this year. I think you would argue that we should at least have a strong bias toward assuming that if X and Y are above the average for all batters over all years this batter played (excluding the current year) then we expect a strikeout fraction higher than both X and Y.
The results show otherwise. Using all data since 1946, we get a regression of 0.874*X + 0.091*Y + 0.767, so the predicted number of strikeouts this year for this batter is (0.874 times his lifetime fraction excluding this year plus 0.091 times this year's average excluding this batter) times his number of plate appearances plus 0.767 (regression on the fractions directly, instead of predicted total strikeouts, puts far too much weight on the batters with few appearances). The standard errors are small, 0.003, 0.002 and 0.044 respectively, and the R^2 is 94%. So we can assert with some confidence that at least over the entire population, the probability of strikeout is between X and Y, because the constant term is significantly positive, and the coefficients on X and Y add up to significantly less than one.
The table below shows data for six subpopulations. The bottom 1% of the batter/year combinations in terms of strikeout frequency, that is the 1% of the batters with the lowest lifetime strikeout frequency in the years with lowest strikeout frequency, were predicted to strikeout 4.1% of the time based on batter's lifetime frequency, 11.6% of the time based on the year, and in fact struck out 4.7% of the time, in between the two predictions. For each subpopulation, the actual frequency is in between the two predictions. I looked at lots of other subgroups, and cannot find any results outside the two predictions. I'm sure you could, either by data mining or picking very small subgroups, but overall I think the data are pretty clear.
X Y Z
Bottom 1% 4.1% 11.6% 4.7%
Bottom 10% 7.7% 13.3% 7.9%
Bottom 50% 13.1% 14.6% 13.2%
Top 50% 27.6% 15.7% 26.6%
Top 10% 53.7% 15.3% 41.5%
Top 1% 68.1% 16.1% 46.8%
Now I have a question for you. I could do the same study using pitcher's lifetime statistics instead of batters. What do you guess I would find? The same results or a regression intercept less than zero with coefficients on X and Y that added up to more than one?