WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot - Page 55 - Poker News

I asked earlier in the thread how statistical significance would be calculated for this and nobody responded. Shouldn't this have been stated somewhere prior to the match?

Perhaps they should also have realized statistical significance >95% would never be met in 20/40/80k hands against tough competition... Seems like a freeroll for news headlines though. That aspect would get ignored if Claudico won, yet would be brought up if it lost. (Although I am not suggesting there was any scheming ahead of time... just a shift in criteria post loss)

They knew from the bot competitions that it took nearly 1mil hands against "Prelude" to get to >95%.

Quote

05-08-2015 , 05:14 PM

#1353

nburch

newbie

Join Date: Jan 2015 Posts: 47

Quote:

Originally Posted by Wasp

It is a great measurement. So if we remove the weakest opponents (fishes) Tartanian beat the field with 2 bb/100. We do it in Claudico vs Brains competitions too because the Brains vs fishes result doesn't matter neither.

And can we say the Brains beats Claudico at least about the same way as Tartanian beats the field of bot it played against?

If we can, can we say Brains was like a "nuclear weapon" against CMU's poker artificial intelligence?

Ignoring everything but the math, they are different situations. With only 40k pairs of duplicate hands, all you can say is that with 95% confidence that the expected value of humans vs. Claudico is between -1.19bb/100hand and +19.51bb/100hand (I think there's an assumption of stationarity here too, but... what else are you going to do?) The computer poker results have a smaller observed edge but much less variance, so that for example you can say with 95% confidence that the expected value of Tartanian7 vs. Prelude is between +0.398bb/100hand and +3.554bb/100hand: all positive.

TimTamBiscuit's point about 1-sided vs 2-sided is something else again. If we've assumed things are normally distributed and 9.16 +/- 10.35 is one 95% confidence interval, then >= 9.16-8.77 is another 95% confidence interval, and that one is all positive...

edit: (just to note the "we" here is an academic writing tic -- it's we as an in all you readers, plus me. I'm not involved with the match in any way...)

Last edited by nburch; 05-08-2015 at 05:29 PM.

Quote

05-08-2015 , 05:33 PM

#1354

feedmykids2

Pooh-Bah

Join Date: May 2012 Posts: 4,527

Quote:

Originally Posted by Kirbynator

well yeah

but still.

there was like <10 all in preflop in 40k hands.

And both the bot and the players 4bet/5b bluffed a decent amount

Ya I def agree it was a surprisingly few amount of preflop all ins

Quote

05-08-2015 , 05:34 PM

#1355

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Quote:

Originally Posted by nburch

TimTamBiscuit's point about 1-sided vs 2-sided is something else again. If we've assumed things are normally distributed and 9.16 +/- 10.35 is one 95% confidence interval, then >= 9.16-8.77 is another 95% confidence interval, and that one is all positive...

Exactly. Humans decisively defeated the AI based on proper study design as opposed to claiming a dubious tie by after-the-fact fudging of results.

Hypotheses must be established prior to a study (this is so standard in academic circles it is not debateable as it prevents actual results from influencing conclusions (in this case rejecting the proposition that Humans are better than the Bot).

The appropriate apriori hypothesis was that Human team is better than the Bot. Making this presumption in study design permits a higher confidence interval of 95% at a lower winrate of 8.77. The alternative apriori hypothesis that the Bot and Humans are equal would be bad design since we had every experiential reason to believe Humans were better and from a study design perspective assuming this permits a tighter statistical test for the Bot to have to jump over ie gives the study greater statistical power to discriminate between the null hypothesis and the alternative that Humans are better than the AI.

Next time Humans should get a statistician independent of the Uni to frame apriori hypothesis and exact stat tests to be used up front.

This should not be being debated after the fact. It is a clear win for the Humans at 95% confidence interval.

I await the Prof correcting the misleading press release.

Quote

05-08-2015 , 05:47 PM

#1356

Wasp

enthusiast

Join Date: Feb 2010 Posts: 70

Quote:

Originally Posted by nburch

well it seems logic. but: if it is a competition then has to be a winner sometimes and if you use too high confidence criteria there wont be any.

(If it is not a competition there only one conclude we can make: in realistic scenario we can say at 95% confidence, that Claudico we saw never will be able to beat the rake on top regulars.)

But I am optimistic and I think the event was as it told. A competition against near top human players and the best poker AI today. So we need to lower are requirements about confidentiality, and we can make it easily: lets calculate 90% or 80% confidence criteria then if we play less hands against a tougher competition. On a smaller sample this is a fair and scientifically correct and acceptable way to decide who is the winner if I think it right.

For example we can say: even though the small sample size we can say the humans was better than Claudico but with "only" with 90% confidence.

(I am not even mentioning if you integrate the winrate/probabilities curve you give a way higher value for the human team than Tartanian achieved against the bots, not even mentioning the human unfriendly environment, etc.)

Lowering confidence criteria on smaller sample size and tougher competition is logical and the only way that can be correct, if we wanted to see a fair competition, right?

Quote

05-08-2015 , 05:47 PM

#1357

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Given the absurdity of the AI folding A4 in the hand where Doug had 99 and the bot had a forced pot odds call, the Bot's playing standard is uneven. In some spots like this A4 hand it makes classic beginner mistakes. But it was alarming to see the Prof fail to understand why this was a beginner error despite Doug trying multiple times to explain it before giving up.

There is something fundamentally wrong in the Bot's algorithm for it to fold when pot committed against the tightest range possible. The Bot has either not taken into account pot odds or ranged Doug at solely AA. But a range of solely AA cannot be GTO as it must be balanced by bluffs. But if the Bot ranged Doug at {AA + balanced bluffs} A4 becomes a forced call.

It would be interesting to hear from the AI team what is fundamentally wrong with the bot's approach in these pot committed spots.

I think the Humans regularly took advantage of this flaw by shoving over a big pot much more frequently than they would have against a human opponent.

Quote

05-08-2015 , 05:51 PM

#1358

+VLFBERH+T

grinder

Join Date: Mar 2013 Posts: 657

Quote:

Originally Posted by nburch

So, from this, can we assume then that std.dev for 40k hands was ~ 106 ?

Quote

05-08-2015 , 05:57 PM

#1359

polarizeddeck

newbie

Join Date: Apr 2015 Posts: 38

I would be interested to see how they calculated the the confidence interval.

Let's call a pair of mirror hands a "game" so there are 40k games and the humans won at 18.28bb/100 games (double the bb/100hands winrate)

Let X and Y be two random variables describing two hands that are mirrored. For simplicity we can assume V(X) = V(Y) = v.

Note V(X+Y) = V(X) + V(Y) + 2cov(X,Y) = 2v*(1+rho), where rho is the correlation between the outcome of two mirrored hands. So the variance of a game is 2v*(1+rho)

Let n be the number of games and w the winrate. Then the standard deviation of the average is: sqrt(2v*(1+rho)/n). As long as w > 1.96*sqrt(2v*(1+rho)/n), the humans are winning at the 95% confidence level (using two tails).

Suppose we assume that v = 256, corresponding to a stdev bb/hand of 16 or 160bb/100.
Then as long as rho <= -0.33 (which I think is reasonable), the humans are winning at 95%.

This obviously isn't what I would do if I had the data but as an estimate it just seems really close, too close to be a "tie"

Quote

05-08-2015 , 05:58 PM

#1360

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Quote:

Originally Posted by Wasp

well it seems logic. but: if it is a competition then has to be a winner sometimes and if you use too high confidence criteria there wont be any.

So we need to lower are requirements about confidence, and we can make it easily: lets calculate 90% or 80% confidence criteria then if we play less hands against a tougher competition. On a smaller sample this is a fair and scientifically correct and acceptable way to decide who is the winner if I think it right.

No, it is bad study design to arbitrarily set the confidence interval. The usual 95% confidence interval is a balance between having sufficient stats power to discriminate between the alternative hypothesis and the null hypothesis and yet avoid falsely rejecting the hypothesis. If you make the confidence interval, say 80% you increase the risk of falsely rejecting the null hypothesis to 20%. If you make the confidence interval too high say 99% you increase the risk of failing to reject the null hypothesis when it is in actuality false if that makes sense.

That is why it is crucial to set the hypotheses correctly up front (apriori) because it permits a higher confidence interval with a lower winning rate.

In this case Humans defeated the AI with 95% confidence given the apriori hypothesis that Humans are better than the AI.

Quote

05-08-2015 , 06:02 PM

#1361

whininguser

grinder

Join Date: Apr 2009 Posts: 400

Quote:

Originally Posted by Frankie Fuzz

I asked earlier in the thread how statistical significance would be calculated for this and nobody responded. Shouldn't this have been stated somewhere prior to the match?

No. If bot won -> statistically important. If not -> not important

Quote

05-08-2015 , 06:02 PM

#1362

nburch

newbie

Join Date: Jan 2015 Posts: 47

Quote:

Originally Posted by Wasp

well it seems logic. but: if it is a competition then has to be a winner sometimes and if you use too high confidence criteria there wont be any.

(If it is not a competition there only one conclude we can make: in realistic scenario we can say at 95% confidence, that Claudico we saw never will be able to beat the rake on top regulars.)

But I am optimistic and I think the event was as it told. A competition against near top human players and the best poker AI today. So we need to lower are requirements about confidentiality, and we can make it easily: lets calculate 90% or 80% confidence criteria then if we play less hands against a tougher competition. On a smaller sample this is a fair and scientifically correct and acceptable way to decide who is the winner if I think it right.

For example can we say: even though the small sample size we can say the humans was better than Claudico with 90% confidence. And if we measure the human team we can say they had a higher winrate at same confidence, than Tartanian had against a weaker competition.

(I am not even mentioning if you integrate the winrate/probabilities curve you give a way higher value for the human team than Tartanian achieved against the bots.)

Lowering confidence criteria on smaller sample size and tougher competition is logical and the only way that is correct, right?

I'm pretty sure you are never, ever supposed to do that. You decide ahead of time what the right test is and then gather data to do the test. You don't gather data, and then decide afterwards what the test should be :/ You have to be willing to accept "I don't know" as an answer.

If you don't like "I don't know" as an answer, use what you did have to make a guess at what you need to do to get a different answer.

Quote

05-08-2015 , 06:02 PM

#1363

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Quote:

Originally Posted by +VLFBERH+T

So, from this, can we assume then that std.dev for 40k hands was ~ 106 ?

That is a two-sided confidence interval and that is wrong to apply here. We believe we know the Humans have a winrate better than zero in a rake-free environment! We further believe the Humans are better than the AI.

So in your spreadsheet put a confidence interval of 90% and that will tell you the one-sided 95% CI appropriate to the Human vs AI challenge.

Quote

05-08-2015 , 06:08 PM

#1364

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors

Quote:

In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a "false positive"), while a type II error is the failure to reject a false null hypothesis (a "false negative"). More simply stated, a type I error is detecting an effect that is not present, while a type II error is failing to detect an effect that is present. The terms "type I error" and "type II error" are often used interchangeably with the general notion of false positives and false negatives in binary classification, such as medical testing, but narrowly speaking refer specifically to statistical hypothesis testing in the Neyman–Pearson framework

The confidence interval is typically set at 95% to get a balance between Type 1 and Type 2 errors.

In the current case the Uni is making a Type II error by failing to use the one-sided test and wrongly concluding from the two-sided test that their Null Hypothesis that Bot=Human is still valid when it is likely false (at 95% confidence).

Quote

05-08-2015 , 06:13 PM

#1365

polarizeddeck

newbie

Join Date: Apr 2015 Posts: 38

What is most absurd is that the article looks at the ratio 732k won/170million bet and pointing out small a # that is. Change the 732k to 849k (still less than half of 1%) and suddenly the humans are (even using 2 sided confidnece interval) "statistically winning".

Quote

05-08-2015 , 06:14 PM

#1366

Wasp

enthusiast

Join Date: Feb 2010 Posts: 70

Quote:

Originally Posted by nburch

we asked before the experiment how much winrate will be "statistically significant". There was no answer.

After the competition I see smart men playing with numbers while they cannot say: there is a few percent probability that "we dont know who won ergo it is a tie" which is a lie.

We can say: there is (for example) 90% that humans way better than Claudico or 10 percent for the opposite.

If you tell the people BEFORE the competition that 10bb/100 wont be enough for the human team to prove they are better than Claudico then every poker player would have told you it is ridiculous.

AFTER the competition easy to play numbers to lessen the damage of the loss. It is okay, I get it. It is just unfair.

Quote

05-08-2015 , 06:15 PM

#1367

+VLFBERH+T

grinder

Join Date: Mar 2013 Posts: 657

Quote:

Originally Posted by TimTamBiscuit

Honestly, I didn't know what to expect ex-ante. In 1997 Kasparov vs Deep Blue, most people would probably have put the most likely outcome as "brain wins". 1997 "brain" got extremely tilted and lost, and I may be almost the only one itt who says this, but before the match, I wouldn't have been too sure which side to bet on, I simply didn't have information on Claudico. A lot of people also just want the humans to win, for obvious reasons, but for a statistical significance test as a hypothesis, I don't think it would have been that obvious to take the humans to win as the hypothesis.

Quote

05-08-2015 , 06:18 PM

#1368

nburch

newbie

Join Date: Jan 2015 Posts: 47

Quote:

Originally Posted by TimTamBiscuit

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors

The confidence interval is typically set at 95% to get a balance between Type 1 and Type 2 errors.

In the current case the Uni is making a Type II error by failing to use the one-sided test and wrongly concluding from the two-sided test that their Null Hypothesis that Bot=Human is still valid when it is likely false (at 95% confidence).

Eh. I disagree with that necessarily being the hypothesis. The question is "is Claudico better, or worse, or we can't tell?" There are three answers, and the 2-tailed test is appropriate.

So... Actually I take back what I said earlier. The two tailed test is right, and the answer is -- there weren't enough hands to say.

edit: even if there was no clear question stated ahead of time, the two tailed test seems like the obvious question to me: I would at least be hoping to be able to say I won, not just say I didn't lose... It's just kind of awkward when some humans _do_ have that assumption, making it a one-tailed test for them, which happens to pass :/
edit 2: I guess it also makes a fantastic example of why you're always supposed to set up statistical tests ahead of time.

Last edited by nburch; 05-08-2015 at 06:26 PM.

Quote

05-08-2015 , 06:23 PM

#1369

polarizeddeck

newbie

Join Date: Apr 2015 Posts: 38

Looking at hard cutoffs to determine "win" or "tie" is extremely arbitrary and the cause of a lot of bad science that is practiced (see pvalue hacking).

In reality, evidence should be assessed in a continuous manner. If their bb/100 was 10.3498 would it make sense to also say it's a "tie"? Is 10.3498bb/100 much different from 10.3501bb/100?

It would be much more honest to say that the humans crushed the bot with pretty high statistical significance.

Quote

05-08-2015 , 06:37 PM

#1370

+VLFBERH+T

grinder

Join Date: Mar 2013 Posts: 657

Quote:

Originally Posted by nburch

In any case: is 106 the correct standard deviation then ?

Quote

05-08-2015 , 06:37 PM

#1371

nburch

newbie

Join Date: Jan 2015 Posts: 47

Quote:

Originally Posted by polarizeddeck

No! It's so very tempting, and I just made the same mistake.

Come up with your question ahead of time, decide what answers you want to be able to distinguish, and how much error you're willing to accept. Run the test and accept the answer your data tells you, which might be "I don't know."

If the magic cutoff is 10.35, 10.3498 tells you... "you don't know." There's a wide range of values which mean you don't know, and you shouldn't treat any of them specially. It can suggest a new question, and that should have new data.

Quote

05-08-2015 , 06:41 PM

#1372

polarizeddeck

newbie

Join Date: Apr 2015 Posts: 38

Quote:

Originally Posted by nburch

I disagree. You're describing a decision theory framework in which you decide a test ahead of time and adjust it to get bounds on type I and type II errors.

This doesn't mean that evidence should not be assessed in a continuous manner. If you think about it from a bayesian perspective it will be more clear.

Coming up with a question ahead of time is indeed good, because it helps one avoid the problem of testing too many hypothesis (which is a major problem in social sciences, in which one comes up with the hypothesis after the data, testing so many that some of them end up "significant").

For example, suppose I flip a coin 10 times and 8 out of 10 times it comes out heads. This many not be statistically significant evidence at the 95% level that the coin is biased towards heads, but any rational person would bet money (if they were forced to be) on heads for the next outcome.

Quote

05-08-2015 , 06:47 PM

#1373

nburch

newbie

Join Date: Jan 2015 Posts: 47

Quote:

Originally Posted by +VLFBERH+T

In any case: is 106 the correct standard deviation then ?

It seems like there's a factor of 10 missing somehow. I assume Noam used s*1.96/sqrt(40000) to come up with confidence interval, where s is the standard deviation of a single duplicate-hand average value*. So 10.35bb/100hand = s*1.96/sqrt(40000) => s=1056bb/100hand. Or you could do s=.1035bb/hand*sqrt(40000)/1.96=10.56bb/hand.
Funny units? The digits seem right.

* someone was doing something with duplicate, and wondering why it's even close. The duplicate values should be the average, not just the sum. You would like expected value of the 80k hands to be the same as the 40k duplicate hands, and that doesn't happen unless you take the average value for the pairs.

Quote

05-08-2015 , 07:00 PM

#1374

nburch

newbie

Join Date: Jan 2015 Posts: 47

Quote:

Originally Posted by polarizeddeck

What is a continuous manner of assessment telling you?

Clearly, the just-finished run was so close to being a statistically significant win for humans that it would suggest that any subsequent test of the same players should probably use a single-tailed test assuming humans are better. That's all Bayes would be saying, no? My posterior beliefs are now strongly biased towards assuming these humans are ahead of this bot (and maybe good humans and bots in general, for now.)
You could look at the data and come up with good betting odds.
And with nothing else stated ahead of time, maybe anyone who wrote, ahead of time, that they assume humans are better, should possibly be justified in assuming they're correct...