I guess I'll hop in here and answer a few questions and clarify a few points too.
First, the 95% threshold was not chosen arbitrarily. That's the standard generally used in science, and the standard used in the ACPC. I personally would have avoided the term "statistical tie" which does NOT mean the same thing as "tie", and would have instead opted for the equivalent term "not statistically significant", but I didn't write the press release.
Earlier I mentioned that the pros would have needed to win by 10.35 BB/100 for statistical significance. This does correctly account for the hands being mirrored. We knew going into this that getting statistical significance would be tough, but it was definitely possible for one side or the other to reach that threshold. Based on experiments, we estimated going into the competition that the win rate would need to be ~8.5 BB/100. However, it's impossible to say what the needed threshold is until after the competition ends, because it depends on how the pros and the bot play. The bot and humans ended up playing very aggressively, and very differently, so that likely pushed up the variance.
Equity chopping did reduce the variance (there was some debate if this would actually be the case, since mirrored hands can play out very differently when there's an equity chop), but not by much. Without the equity chop, the needed threshold would have been 10.72 BB/100 (and, interestingly, the pros' overall win rate would have dropped to 7.0 BB/100).
I think anyone that looks at this objectively will recognize we did everything we could to achieve statistical significance: 4 humans, mirrored hands, equity chopping, and as many hands as the humans could possibly play. That said, we recognize there are things we could improve if there's a next time (more humans, multitabling, and maybe even more variance-reduction techniques).
200BB was chosen for this competition and as the ACPC format because it is more challenging for bots due to the much larger game tree. We trained new strategies for this competition, and we could have easily switched them to be in 100BB format. We didn't pick 200BB to make this easy on ourselves, just as we didn't pick some big-name poker players who are not so great in heads up no-limit to make this easy on ourselves.
I'll also say that I think the humans had a stronger edge than the statistics reveal. This is from me personally watching the humans and bot play over the past two weeks, and can't be captured in the statistics which just look at the scores for each hand. There are definitely weaknesses in the bot, and they are weaknesses that can't be fixed with just more memory or more cores. As a researcher, that's very exciting, because it means we need new approaches to address the unique challenges of no-limit games. The next couple years will be very busy for me!
For those of you that are interested, here's the breakdown of the scores per session (usually 750-800 hands):