I have some RSIF analysis here for potentially improving DNRS's approach. Since pitchers bat in NL, full game totals for NL look around half run less on average compared to AL.
However, I parsed all of MLB play by play and pitchers almost never bat in the first inning.
So NL full game totals are deflated for reasons that don't affect the 1st inning. How do we adjust the table (tm) for this? I tried the following approaches.
1) For NL games, add 0.5 runs to the total. Bulid a new league-adjusted table (tm).
2) Use two separate tables for AL and NL.
3) Use two tables, but if a particular total has too few samples revert to using the original table (tm)
I tried all three approaches. To compare them I use mean logloss.
This is for data since 2011, only using data up to the particular game date. For logloss computation I start from 2014 onward so that I start with at least 3 seasons of data in the table.
It looks like the third approach is the best. Interestingly adding 0.5 a run slightly outperforms the two table approach. The change in logloss seems fairly small though, maybe I'm being fooled by randomness. Any thoughts, critiques, suggestions?
Last edited by DicedPineapples; 05-21-2020 at 09:29 PM.
Reason: spelling