Open Side Menu Go to the Top
Register
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot

05-08-2015 , 04:55 PM
Quote:
Originally Posted by feedmykids2
They really only played 40,000 hands.
well yeah

but still.

there was like <10 all in preflop in 40k hands.

And both the bot and the players 4bet/5b bluffed a decent amount

Last edited by Kirbynator; 05-08-2015 at 05:19 PM.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:05 PM
Quote:
Originally Posted by Frankie Fuzz
I asked earlier in the thread how statistical significance would be calculated for this and nobody responded. Shouldn't this have been stated somewhere prior to the match?

Perhaps they should also have realized statistical significance >95% would never be met in 20/40/80k hands against tough competition... Seems like a freeroll for news headlines though. That aspect would get ignored if Claudico won, yet would be brought up if it lost. (Although I am not suggesting there was any scheming ahead of time... just a shift in criteria post loss)

They knew from the bot competitions that it took nearly 1mil hands against "Prelude" to get to >95%.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:14 PM
Quote:
Originally Posted by Wasp
It is a great measurement. So if we remove the weakest opponents (fishes) Tartanian beat the field with 2 bb/100. We do it in Claudico vs Brains competitions too because the Brains vs fishes result doesn't matter neither.

And can we say the Brains beats Claudico at least about the same way as Tartanian beats the field of bot it played against?

If we can, can we say Brains was like a "nuclear weapon" against CMU's poker artificial intelligence?
Ignoring everything but the math, they are different situations. With only 40k pairs of duplicate hands, all you can say is that with 95% confidence that the expected value of humans vs. Claudico is between -1.19bb/100hand and +19.51bb/100hand (I think there's an assumption of stationarity here too, but... what else are you going to do?) The computer poker results have a smaller observed edge but much less variance, so that for example you can say with 95% confidence that the expected value of Tartanian7 vs. Prelude is between +0.398bb/100hand and +3.554bb/100hand: all positive.

TimTamBiscuit's point about 1-sided vs 2-sided is something else again. If we've assumed things are normally distributed and 9.16 +/- 10.35 is one 95% confidence interval, then >= 9.16-8.77 is another 95% confidence interval, and that one is all positive...

edit: (just to note the "we" here is an academic writing tic -- it's we as an in all you readers, plus me. I'm not involved with the match in any way...)

Last edited by nburch; 05-08-2015 at 05:29 PM.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:33 PM
Quote:
Originally Posted by Kirbynator
well yeah

but still.

there was like <10 all in preflop in 40k hands.

And both the bot and the players 4bet/5b bluffed a decent amount
Ya I def agree it was a surprisingly few amount of preflop all ins
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:34 PM
Quote:
Originally Posted by nburch
TimTamBiscuit's point about 1-sided vs 2-sided is something else again. If we've assumed things are normally distributed and 9.16 +/- 10.35 is one 95% confidence interval, then >= 9.16-8.77 is another 95% confidence interval, and that one is all positive...
Exactly. Humans decisively defeated the AI based on proper study design as opposed to claiming a dubious tie by after-the-fact fudging of results.

Hypotheses must be established prior to a study (this is so standard in academic circles it is not debateable as it prevents actual results from influencing conclusions (in this case rejecting the proposition that Humans are better than the Bot).

The appropriate apriori hypothesis was that Human team is better than the Bot. Making this presumption in study design permits a higher confidence interval of 95% at a lower winrate of 8.77. The alternative apriori hypothesis that the Bot and Humans are equal would be bad design since we had every experiential reason to believe Humans were better and from a study design perspective assuming this permits a tighter statistical test for the Bot to have to jump over ie gives the study greater statistical power to discriminate between the null hypothesis and the alternative that Humans are better than the AI.

Next time Humans should get a statistician independent of the Uni to frame apriori hypothesis and exact stat tests to be used up front.

This should not be being debated after the fact. It is a clear win for the Humans at 95% confidence interval.

I await the Prof correcting the misleading press release.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:47 PM
Quote:
Originally Posted by nburch
Ignoring everything but the math, they are different situations. With only 40k pairs of duplicate hands, all you can say is that with 95% confidence that the expected value of humans vs. Claudico is between -1.19bb/100hand and +19.51bb/100hand (I think there's an assumption of stationarity here too, but... what else are you going to do?) The computer poker results have a smaller observed edge but much less variance, so that for example you can say with 95% confidence that the expected value of Tartanian7 vs. Prelude is between +0.398bb/100hand and +3.554bb/100hand: all positive.

TimTamBiscuit's point about 1-sided vs 2-sided is something else again. If we've assumed things are normally distributed and 9.16 +/- 10.35 is one 95% confidence interval, then >= 9.16-8.77 is another 95% confidence interval, and that one is all positive...

edit: (just to note the "we" here is an academic writing tic -- it's we as an in all you readers, plus me. I'm not involved with the match in any way...)
well it seems logic. but: if it is a competition then has to be a winner sometimes and if you use too high confidence criteria there wont be any.

(If it is not a competition there only one conclude we can make: in realistic scenario we can say at 95% confidence, that Claudico we saw never will be able to beat the rake on top regulars.)

But I am optimistic and I think the event was as it told. A competition against near top human players and the best poker AI today. So we need to lower are requirements about confidentiality, and we can make it easily: lets calculate 90% or 80% confidence criteria then if we play less hands against a tougher competition. On a smaller sample this is a fair and scientifically correct and acceptable way to decide who is the winner if I think it right.

For example we can say: even though the small sample size we can say the humans was better than Claudico but with "only" with 90% confidence.

(I am not even mentioning if you integrate the winrate/probabilities curve you give a way higher value for the human team than Tartanian achieved against the bots, not even mentioning the human unfriendly environment, etc.)

Lowering confidence criteria on smaller sample size and tougher competition is logical and the only way that can be correct, if we wanted to see a fair competition, right?
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:47 PM
Given the absurdity of the AI folding A4 in the hand where Doug had 99 and the bot had a forced pot odds call, the Bot's playing standard is uneven. In some spots like this A4 hand it makes classic beginner mistakes. But it was alarming to see the Prof fail to understand why this was a beginner error despite Doug trying multiple times to explain it before giving up.

There is something fundamentally wrong in the Bot's algorithm for it to fold when pot committed against the tightest range possible. The Bot has either not taken into account pot odds or ranged Doug at solely AA. But a range of solely AA cannot be GTO as it must be balanced by bluffs. But if the Bot ranged Doug at {AA + balanced bluffs} A4 becomes a forced call.

It would be interesting to hear from the AI team what is fundamentally wrong with the bot's approach in these pot committed spots.

I think the Humans regularly took advantage of this flaw by shoving over a big pot much more frequently than they would have against a human opponent.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:51 PM
Quote:
Originally Posted by nburch
Ignoring everything but the math, they are different situations. With only 40k pairs of duplicate hands, all you can say is that with 95% confidence that the expected value of humans vs. Claudico is between -1.19bb/100hand and +19.51bb/100hand (I think there's an assumption of stationarity here too, but... what else are you going to do?) The computer poker results have a smaller observed edge but much less variance, so that for example you can say with 95% confidence that the expected value of Tartanian7 vs. Prelude is between +0.398bb/100hand and +3.554bb/100hand: all positive.

TimTamBiscuit's point about 1-sided vs 2-sided is something else again. If we've assumed things are normally distributed and 9.16 +/- 10.35 is one 95% confidence interval, then >= 9.16-8.77 is another 95% confidence interval, and that one is all positive...

edit: (just to note the "we" here is an academic writing tic -- it's we as an in all you readers, plus me. I'm not involved with the match in any way...)
So, from this, can we assume then that std.dev for 40k hands was ~ 106 ?
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:57 PM
I would be interested to see how they calculated the the confidence interval.

Let's call a pair of mirror hands a "game" so there are 40k games and the humans won at 18.28bb/100 games (double the bb/100hands winrate)

Let X and Y be two random variables describing two hands that are mirrored. For simplicity we can assume V(X) = V(Y) = v.

Note V(X+Y) = V(X) + V(Y) + 2cov(X,Y) = 2v*(1+rho), where rho is the correlation between the outcome of two mirrored hands. So the variance of a game is 2v*(1+rho)

Let n be the number of games and w the winrate. Then the standard deviation of the average is: sqrt(2v*(1+rho)/n). As long as w > 1.96*sqrt(2v*(1+rho)/n), the humans are winning at the 95% confidence level (using two tails).

Suppose we assume that v = 256, corresponding to a stdev bb/hand of 16 or 160bb/100.
Then as long as rho <= -0.33 (which I think is reasonable), the humans are winning at 95%.

This obviously isn't what I would do if I had the data but as an estimate it just seems really close, too close to be a "tie"
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 05:58 PM
Quote:
Originally Posted by Wasp
well it seems logic. but: if it is a competition then has to be a winner sometimes and if you use too high confidence criteria there wont be any.

So we need to lower are requirements about confidence, and we can make it easily: lets calculate 90% or 80% confidence criteria then if we play less hands against a tougher competition. On a smaller sample this is a fair and scientifically correct and acceptable way to decide who is the winner if I think it right.
No, it is bad study design to arbitrarily set the confidence interval. The usual 95% confidence interval is a balance between having sufficient stats power to discriminate between the alternative hypothesis and the null hypothesis and yet avoid falsely rejecting the hypothesis. If you make the confidence interval, say 80% you increase the risk of falsely rejecting the null hypothesis to 20%. If you make the confidence interval too high say 99% you increase the risk of failing to reject the null hypothesis when it is in actuality false if that makes sense.

That is why it is crucial to set the hypotheses correctly up front (apriori) because it permits a higher confidence interval with a lower winning rate.

In this case Humans defeated the AI with 95% confidence given the apriori hypothesis that Humans are better than the AI.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:02 PM
Quote:
Originally Posted by Frankie Fuzz
I asked earlier in the thread how statistical significance would be calculated for this and nobody responded. Shouldn't this have been stated somewhere prior to the match?
No. If bot won -> statistically important. If not -> not important
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:02 PM
Quote:
Originally Posted by Wasp
well it seems logic. but: if it is a competition then has to be a winner sometimes and if you use too high confidence criteria there wont be any.

(If it is not a competition there only one conclude we can make: in realistic scenario we can say at 95% confidence, that Claudico we saw never will be able to beat the rake on top regulars.)

But I am optimistic and I think the event was as it told. A competition against near top human players and the best poker AI today. So we need to lower are requirements about confidentiality, and we can make it easily: lets calculate 90% or 80% confidence criteria then if we play less hands against a tougher competition. On a smaller sample this is a fair and scientifically correct and acceptable way to decide who is the winner if I think it right.

For example can we say: even though the small sample size we can say the humans was better than Claudico with 90% confidence. And if we measure the human team we can say they had a higher winrate at same confidence, than Tartanian had against a weaker competition.

(I am not even mentioning if you integrate the winrate/probabilities curve you give a way higher value for the human team than Tartanian achieved against the bots.)

Lowering confidence criteria on smaller sample size and tougher competition is logical and the only way that is correct, right?
I'm pretty sure you are never, ever supposed to do that. You decide ahead of time what the right test is and then gather data to do the test. You don't gather data, and then decide afterwards what the test should be :/ You have to be willing to accept "I don't know" as an answer.

If you don't like "I don't know" as an answer, use what you did have to make a guess at what you need to do to get a different answer.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:02 PM
Quote:
Originally Posted by +VLFBERH+T
So, from this, can we assume then that std.dev for 40k hands was ~ 106 ?
That is a two-sided confidence interval and that is wrong to apply here. We believe we know the Humans have a winrate better than zero in a rake-free environment! We further believe the Humans are better than the AI.

So in your spreadsheet put a confidence interval of 90% and that will tell you the one-sided 95% CI appropriate to the Human vs AI challenge.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:08 PM
http://en.wikipedia.org/wiki/Type_I_and_type_II_errors

Quote:
In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a "false positive"), while a type II error is the failure to reject a false null hypothesis (a "false negative"). More simply stated, a type I error is detecting an effect that is not present, while a type II error is failing to detect an effect that is present. The terms "type I error" and "type II error" are often used interchangeably with the general notion of false positives and false negatives in binary classification, such as medical testing, but narrowly speaking refer specifically to statistical hypothesis testing in the Neyman–Pearson framework
The confidence interval is typically set at 95% to get a balance between Type 1 and Type 2 errors.

In the current case the Uni is making a Type II error by failing to use the one-sided test and wrongly concluding from the two-sided test that their Null Hypothesis that Bot=Human is still valid when it is likely false (at 95% confidence).
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:13 PM
What is most absurd is that the article looks at the ratio 732k won/170million bet and pointing out small a # that is. Change the 732k to 849k (still less than half of 1%) and suddenly the humans are (even using 2 sided confidnece interval) "statistically winning".
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:14 PM
Quote:
Originally Posted by nburch
I'm pretty sure you are never, ever supposed to do that. You decide ahead of time what the right test is and then gather data to do the test. You don't gather data, and then decide afterwards what the test should be :/ You have to be willing to accept "I don't know" as an answer.

If you don't like "I don't know" as an answer, use what you did have to make a guess at what you need to do to get a different answer.
we asked before the experiment how much winrate will be "statistically significant". There was no answer.

After the competition I see smart men playing with numbers while they cannot say: there is a few percent probability that "we dont know who won ergo it is a tie" which is a lie.

We can say: there is (for example) 90% that humans way better than Claudico or 10 percent for the opposite.

If you tell the people BEFORE the competition that 10bb/100 wont be enough for the human team to prove they are better than Claudico then every poker player would have told you it is ridiculous.

AFTER the competition easy to play numbers to lessen the damage of the loss. It is okay, I get it. It is just unfair.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:15 PM
Quote:
Originally Posted by TimTamBiscuit
That is a two-sided confidence interval and that is wrong to apply here. We believe we know the Humans have a winrate better than zero in a rake-free environment! We further believe the Humans are better than the AI.

So in your spreadsheet put a confidence interval of 90% and that will tell you the one-sided 95% CI appropriate to the Human vs AI challenge.
Honestly, I didn't know what to expect ex-ante. In 1997 Kasparov vs Deep Blue, most people would probably have put the most likely outcome as "brain wins". 1997 "brain" got extremely tilted and lost, and I may be almost the only one itt who says this, but before the match, I wouldn't have been too sure which side to bet on, I simply didn't have information on Claudico. A lot of people also just want the humans to win, for obvious reasons, but for a statistical significance test as a hypothesis, I don't think it would have been that obvious to take the humans to win as the hypothesis.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:18 PM
Quote:
Originally Posted by TimTamBiscuit
http://en.wikipedia.org/wiki/Type_I_and_type_II_errors



The confidence interval is typically set at 95% to get a balance between Type 1 and Type 2 errors.

In the current case the Uni is making a Type II error by failing to use the one-sided test and wrongly concluding from the two-sided test that their Null Hypothesis that Bot=Human is still valid when it is likely false (at 95% confidence).
Eh. I disagree with that necessarily being the hypothesis. The question is "is Claudico better, or worse, or we can't tell?" There are three answers, and the 2-tailed test is appropriate.

So... Actually I take back what I said earlier. The two tailed test is right, and the answer is -- there weren't enough hands to say.

edit: even if there was no clear question stated ahead of time, the two tailed test seems like the obvious question to me: I would at least be hoping to be able to say I won, not just say I didn't lose... It's just kind of awkward when some humans _do_ have that assumption, making it a one-tailed test for them, which happens to pass :/
edit 2: I guess it also makes a fantastic example of why you're always supposed to set up statistical tests ahead of time.

Last edited by nburch; 05-08-2015 at 06:26 PM.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:23 PM
Looking at hard cutoffs to determine "win" or "tie" is extremely arbitrary and the cause of a lot of bad science that is practiced (see pvalue hacking).

In reality, evidence should be assessed in a continuous manner. If their bb/100 was 10.3498 would it make sense to also say it's a "tie"? Is 10.3498bb/100 much different from 10.3501bb/100?

It would be much more honest to say that the humans crushed the bot with pretty high statistical significance.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:37 PM
Quote:
Originally Posted by nburch
Eh. I disagree with that necessarily being the hypothesis. The question is "is Claudico better, or worse, or we can't tell?" There are three answers, and the 2-tailed test is appropriate.

So... Actually I take back what I said earlier. The two tailed test is right, and the answer is -- there weren't enough hands to say.

edit: even if there was no clear question stated ahead of time, the two tailed test seems like the obvious question to me: I would at least be hoping to be able to say I won, not just say I didn't lose... It's just kind of awkward when some humans _do_ have that assumption, making it a one-tailed test for them, which happens to pass :/
edit 2: I guess it also makes a fantastic example of why you're always supposed to set up statistical tests ahead of time.
In any case: is 106 the correct standard deviation then ?
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:37 PM
Quote:
Originally Posted by polarizeddeck
Looking at hard cutoffs to determine "win" or "tie" is extremely arbitrary and the cause of a lot of bad science that is practiced (see pvalue hacking).

In reality, evidence should be assessed in a continuous manner. If their bb/100 was 10.3498 would it make sense to also say it's a "tie"? Is 10.3498bb/100 much different from 10.3501bb/100?

It would be much more honest to say that the humans crushed the bot with pretty high statistical significance.
No! It's so very tempting, and I just made the same mistake.

Come up with your question ahead of time, decide what answers you want to be able to distinguish, and how much error you're willing to accept. Run the test and accept the answer your data tells you, which might be "I don't know."

If the magic cutoff is 10.35, 10.3498 tells you... "you don't know." There's a wide range of values which mean you don't know, and you shouldn't treat any of them specially. It can suggest a new question, and that should have new data.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:41 PM
Quote:
Originally Posted by nburch
No! It's so very tempting, and I just made the same mistake.

Come up with your question ahead of time, decide what answers you want to be able to distinguish, and how much error you're willing to accept. Run the test and accept the answer your data tells you, which might be "I don't know."

If the magic cutoff is 10.35, 10.3498 tells you... "you don't know." There's a wide range of values which mean you don't know, and you shouldn't treat any of them specially. It can suggest a new question, and that should have new data.
I disagree. You're describing a decision theory framework in which you decide a test ahead of time and adjust it to get bounds on type I and type II errors.

This doesn't mean that evidence should not be assessed in a continuous manner. If you think about it from a bayesian perspective it will be more clear.

Coming up with a question ahead of time is indeed good, because it helps one avoid the problem of testing too many hypothesis (which is a major problem in social sciences, in which one comes up with the hypothesis after the data, testing so many that some of them end up "significant").

For example, suppose I flip a coin 10 times and 8 out of 10 times it comes out heads. This many not be statistically significant evidence at the 95% level that the coin is biased towards heads, but any rational person would bet money (if they were forced to be) on heads for the next outcome.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 06:47 PM
Quote:
Originally Posted by +VLFBERH+T
In any case: is 106 the correct standard deviation then ?
It seems like there's a factor of 10 missing somehow. I assume Noam used s*1.96/sqrt(40000) to come up with confidence interval, where s is the standard deviation of a single duplicate-hand average value*. So 10.35bb/100hand = s*1.96/sqrt(40000) => s=1056bb/100hand. Or you could do s=.1035bb/hand*sqrt(40000)/1.96=10.56bb/hand.
Funny units? The digits seem right.



* someone was doing something with duplicate, and wondering why it's even close. The duplicate values should be the average, not just the sum. You would like expected value of the 80k hands to be the same as the 40k duplicate hands, and that doesn't happen unless you take the average value for the pairs.
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 07:00 PM
Quote:
Originally Posted by polarizeddeck
I disagree. You're describing a decision theory framework in which you decide a test ahead of time and adjust it to get bounds on type I and type II errors.

This doesn't mean that evidence should not be assessed in a continuous manner. If you think about it from a bayesian perspective it will be more clear.

Coming up with a question ahead of time is indeed good, because it helps one avoid the problem of testing too many hypothesis (which is a major problem in social sciences, in which one comes up with the hypothesis after the data, testing so many that some of them end up "significant").

For example, suppose I flip a coin 10 times and 8 out of 10 times it comes out heads. This many not be statistically significant evidence at the 95% level that the coin is biased towards heads, but any rational person would bet money (if they were forced to be) on heads for the next outcome.
What is a continuous manner of assessment telling you?

Clearly, the just-finished run was so close to being a statistically significant win for humans that it would suggest that any subsequent test of the same players should probably use a single-tailed test assuming humans are better. That's all Bayes would be saying, no? My posterior beliefs are now strongly biased towards assuming these humans are ahead of this bot (and maybe good humans and bots in general, for now.)
You could look at the data and come up with good betting odds.
And with nothing else stated ahead of time, maybe anyone who wrote, ahead of time, that they assume humans are better, should possibly be justified in assuming they're correct...
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote
05-08-2015 , 07:02 PM
edit: forget about my question , i should read before asking stupid things...
WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot Quote

      
m