First off, I'm impressed with how thorough and direct this thread has been, despite the contentious and confusing nature of the subject. Really, the sniping is no worse than in academics

.
Secondly, I would wager that the only thing more impressive than what variance can explain is what it absolutely cannot. Talking about EV in a sample this size is getting pretty close, by my back of the envelope.
Third and most importantly, this powerful assertion should be addressed critically if we want to explain it away. I would propose that several factors besides a bastard website might possible explanations. Moreover, I propose a few experiments for obvious ones.
a) Gross error. The calculations ignore something that is generally trivial but will add up over a huge sample size. Like for example the effect of the rake. I hope they include it- and I doubt this is it- I just want to make sure.
b) Failure to identify complete information. The calculation deals with samples where there is more complete information than the software is utilizing. For example, players enter many pots and leave before showdown. Their hands are within a certain range predicted by their VPIP and PFR. Those cards are no longer coming. This COULD explain the effect and there is an easy way to test it.
First, split your sample hands into three groups.
1) Multiple people enter the pot and all go to showdown.
2) Multiple people enter the pot and only two go to showdown.
3) Two people contest the pot the whole way. There must be no limpers or raisers who fold. These hands must be heads up from the first raise onward.
If incomplete information is causing the effect, it will be observed in only group two. If it is observed in all three, that is more worrisome.
Whether or not we observe it in specific samples, try to explain it in sample #2, the sample with the largest expected effect.
Apply prior probabilities to these hands based on the estimated range of the non-showdown players. What fraction of his two cards do you expect to be aces, kings and so on, given his PFR and VPIP from that position? Based on the data you have for the folded opponents' play ask the following:
Are hands that contain cards that improve your hands and not your opponents' (and vice versa) depleted/abundant in the deck that deals the board? How does it change expectation?
If the probability of card depletion is included, does your sample regress towards expectation? I suspect that 50 buy ins difference over this many hands is a trivial amount that card removal could account for.
Furthermore, the fact that opponents DO NOT play hands also alters the likelihood of valuable cards remaining in the deck or collecting in the muck. However, I expect the result there to be more subtle. But, it still could be there. So try that too, and see if it regresses further.
This is a likely source of error that would be grouped more broadly under section c.
c) False discovery. Statistics work best when they're applied exactly in isolation from confounders. Unfortunately myriad factors, not the least of which is multiple testing (an easy mistake to make even for software designers and professional statisticians), can create a test result which spuriously claims significance. In false discovery, these results can be explained away quite rapidly if we identify where the tests were misapplied. Frighteningly, even high profile papers in my field- biomedical research- commit this type of fallacy all the time.
d) There's also the chance that previous error cannot explain any or all of it and you are just far out in the margins. Some poor SOB on 2p2 is bound to be. Life sucks that way.
Good luck. I hope you get to the bottom of it.