WCGRider, Dong Kim, Jason Les and Bjorn Li to play against a new HU bot - Page 56 - Poker News

They had a scoreboard that showed Humans won. The scoreboard was all about human total vs Claudico, that was the contest. If the contest were about statistical significance why did they not have a Bell curve up showing where match stood within 95% confidence interval?

There will always be an argument about who was actually better, but the winner is the one who won the most chips. The Big Blue vs Kasparov match was decided by who won the most points not a 95% confidence interval. 95% is arbitrary anyway, it is no 100% proof either.

I also notice the professor said many times that Claudico was beating a certain player or that Claudico was winning during a certain session. No mention of 95% there.

In academics standard convention of PROOF may be a 95% interval but in human contests of skill the winner is the team that won the most points or chips. Claudico is a sore loser.

Quote

05-08-2015 , 07:29 PM

#1377

Wasp

enthusiast

Join Date: Feb 2010 Posts: 70

Quote:

Originally Posted by nburch

What is a continuous manner of assessment telling you?

Clearly, the just-finished run was so close to being a statistically significant win for humans that it would suggest that any subsequent test of the same players should probably use a single-tailed test assuming humans are better. That's all Bayes would be saying, no? My posterior beliefs are now strongly biased towards assuming these humans are ahead of this bot (and maybe good humans and bots in general, for now.)
You could look at the data and come up with good betting odds.
And with nothing else stated ahead of time, maybe anyone who wrote, ahead of time, that they assume humans are better, should possibly be justified in assuming they're correct...

Would you bet with a larger amount money on Claudico (vs Brains or similarly strong players)? You shouldn't and I wouldn't. But I would bet on the opposite.

If you wouldn't put your 2 cents where you are standing your arguments only some smoke and mirror.

Facts:
- humans did way better than claudico in this experiment
- we are agree that we can say at least 90% confidence that best humans are better then the best AI today.

What would you (or CMU staff) say if humans beat Claudico at 15bb/100? there is 1 percent that Claudico is better than humans so this is a tie?! 99 percent vs 1 percent is not a tie at all. As 95:5 or 90:10 neither

Why is it so damn unfair?

For example I could write easily a stupid bot that openshoves a high percentage of hands. No need million dollar developing cost, staff, mainframe time window renting etc. I have a pretty nice garage so I can do it.

Let's play a relatively small (I decide before the match) amount of hands against Claudico.

After the "experiment" and I can tell to the public:
- if I lose, I could say that I made a bot what is played a statistical tie to Claudico.
- If I win I will tell I have won. I made a bot in my garage alone, that beat the famous CMU AI!

That would be fair from me? Not at all.

But this is exactly what the CMU team did.

If a loser can't be a man and admit his loss it is a bigger shame than anything else.

Quote

05-08-2015 , 07:32 PM

#1378

Nit Bag

centurion

Join Date: May 2012 Posts: 170

When they use this AI for airport security and the airport gets hit my a terrorist attack I hope they don't call it a statistical tie.

Quote

05-08-2015 , 07:34 PM

#1379

+VLFBERH+T

grinder

Join Date: Mar 2013 Posts: 657

Quote:

Originally Posted by Nit Bag

The Big Blue vs Kasparov match was decided by who won the most points not a 95% confidence interval.

In that context, that would be complete nonsense. Indeed.

Quote:

Originally Posted by Nit Bag

95% is arbitrary anyway, it is no 100% proof either.

For a 100% proof, you would need an infinite number of samples. 2 weeks of play for 40k hands was quite an effort already

Quote

05-08-2015 , 07:47 PM

#1380

+VLFBERH+T

grinder

Join Date: Mar 2013 Posts: 657

Quote:

Originally Posted by Wasp

The professor's ego is hurt somewhat, for sure. Leaving that aside, I don't find the statistical significance method objectable per se - they are academics, and in whatever papers will be written about it, no-one would be able to include a sentence like "humans did way better than claudico in this experiment" even if they were willing to. With their statement, they are trying to use the same standard as in the poker bot competition, although of course sample sizes cannot be compared at all, which makes it look so ridiculous to many people.

They should have communicated before the challenge how they determine the success of their project, that is certain. I am pretty sure another such challenge will happen within a year or two, with an improved bot, and they can learn from some of the mistakes being made here.

Quote

05-08-2015 , 07:49 PM

#1381

dodgybob

old hand

Join Date: Jan 2006 Posts: 1,899

Quote:

KDKA’s Susan Koeppen: “Do you get upset when Claudico loses?”
Tuomas Sandholm: “Oh yes, very much.”

lol.

To be fair, for grad students who want to study in a niche area, they often don't have the luxury of choosing between supervisors.

Quote:

“We knew Claudico was the strongest computer poker program in the world, but we had no idea before this competition how it would fare against four Top 10 poker players,” Sandholm says. “It would have been no shame for Claudico to lose to a set of such talented pros, so even pulling off a statistical tie with them is a tremendous achievement.”

This statement is just clearly ridiculous.

I don't understand at all why they have any vested interest whatsoever in angling like this.

Quote

05-08-2015 , 07:57 PM

#1382

dodgybob

old hand

Join Date: Jan 2006 Posts: 1,899

Actually, scrap that, I do have an idea why.

If they are trying to commercialise what they've been working on I guess it makes sense that they want to make it look good.

Also, it would've been interesting if they had included a control group of people who hadn't played poker before, or rec players.

Quote

05-08-2015 , 07:59 PM

#1383

elliot10181

veteran

Join Date: Oct 2011 Posts: 2,720

Quote:

Originally Posted by +VLFBERH+T

I am pretty sure another such challenge will happen within a year or two, with an improved bot, and they can learn from some of the mistakes being made here.

I doubt this was a mistake,

If a number had been released prior to the challenge (10.3bb etc is a win and everything else is a draw) the AI would have been held to the same standard.

As these rules weren't mentioned before a small win by the AI could have been distributed to the media as a win (getting huge coverage and grants etc).

They are obviously just freerolling the players who in reality were already being handicapped by the other rules and pressures.

There is no way even a 2BB win by AI wouldn't have been sold as a huge victory.

Quote

05-08-2015 , 08:00 PM

#1384

polarizeddeck

newbie

Join Date: Apr 2015 Posts: 38

Quote:

Originally Posted by nburch

The idea of looking at the evidence continuously is very simple, it's what humans usually do intuitively.

Suppose you were allowed to stake the humans against the bot for some amount of money over the 80k sample, and suppose you know the winrate in advance.

If the winrate was 0bb/100 you might not stake the humans. If the winrate went up to 5bb/100 you might stake it for more.

However, suppose you had bet $100 on the humans winning after being told that the winrate was 10.350001bb/100 and then someone told you their actual winrate was 10.3499999bb/100, it would be illogical to make a big change to your bet.

Quote

05-08-2015 , 08:22 PM

#1385

elliot10181

veteran

Join Date: Oct 2011 Posts: 2,720

Leaderboard has been removed from the twitch

Quote

05-08-2015 , 08:42 PM

#1386

Nick_AA

old hand

Join Date: Apr 2010 Posts: 1,458

Quote:

Originally Posted by mack's

and once again people turn away from their ego problem to a professor's ego problem.

Doug said that he beats the top regs he plays for more than 9bb and that claudico is extremely good.
This bot is still better than every single person that posted in this thread besides bjorn/doug/dong

No one is debating clautico is pretty damn good .... And maybe world class. It's the gross manipulation by an ego driven self interested dip **** who thinks he's smarter than everyone else (indicated by the thinly veiled false narrative that any half way serious poker player debunked the first moment read). Can you say adelson?

Last edited by Nick_AA; 05-08-2015 at 08:48 PM.

Quote

05-08-2015 , 09:09 PM

#1387

restorativejustice

veteran

Join Date: Aug 2014 Posts: 2,753

To me it boils down like this: If Claudico had beaten the humans for over 9BB/100 (or over $9/hand played) the word "tie" would not be mentioned in any way, shape or form -- It would be described as a slaughter of man by machine.

Don't believe everything you read in the papers, kids; and, know that even in academia people are governed by their ego and biases and desire for the next paycheck.

Quote

05-08-2015 , 09:19 PM

#1388

fixedset

enthusiast

Join Date: Oct 2004 Posts: 54

Quote:

Originally Posted by NoamBrown

I thought I'd clear up some confusion about the statistics. We calculated the 95% confidence interval based on the 80,000 mirrored hands that were played and it was +/- 10.35bb/100.

Is the fact that the hands are mirrored part of the calculation? For instance, if Doug won $2500 on hand #123, and Jason lost $1000 on hand #123. Is this considered a $1500 win for the purposes of calculating the standard deviation?

Quote

05-08-2015 , 09:20 PM

#1389

UpHillBothWays

adept

Join Date: Nov 2013 Posts: 978

GREAT point restorativejustice.

QUESTION TO BOT CREATORS.ARTICLE WRITER: how would you have worded the article if the bot won by the exact same margin?

Quote

05-08-2015 , 09:35 PM

#1390

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Quote:

Originally Posted by Nick_AA

I disagree. I think Claudico plays well in some spots but plays terribly in other commonly occurring spots. The mistake you are making is the logic fallacy of generalization. You wrongly assume Claudico is strong in ALL aspects of its play because you have observed some hands where it appears to have played well. You would need to systematically test Claudico across the many different "buckets" of similar types of hands: play in single raised pots, play in 3B pots, play in 4B pots, play in checked pots, play out-of-position, play in-position, short-stacked, deep-stacked, rainbow board textures, flushing textures, straightening textures, play where the board runout creates a river range containing too many bluffs to value, play where river range contains too few bluffs to value, play exploiting calling stations, play exploiting overly aggressive players, play exploiting Nits, etc, etc, etc.

In actuality, we have observed plenty of evidence to suggest Claudico is terrible at GTO in certain types of spots (= can be easily exploited) and reputedly fails in its design to algorithmically exploit the different types of players at all (it is trying to play GTO and generalise that to non-poker domains as opposed to being a great poker player rapidly observing and adapting and maximally exploiting its opponents):

One refutation hand suggesting Claudico is terrible at 4B/5B size pots pre-flop:

Doug's shoved 99 and AI folded A4o incorrectly getting way more pot odds than needed for an automatic call = Beginner mistake.

Which generalizes to a whole class of error = loss in similar spots.

Claudico also had what is almost certainly an error with 11x overbets on rivers risking far too much into little pots where typically the play had checked until the river then Claudico irrationally shoves. Bluffs should be big enough to do the job and no bigger due to risk/reward ratio. Ergo there should be a range of betsizes for different hand ranges in that kind of spot for GTO strategy. Claudico appears to have overdone the bet-size abstraction and did not show anywhere near the betsize variation we theoretically expect in a GTO strategy.

Also, Claudico didn't handle min-donks well: again a beginner error constantly exploited by the Humans in almost every hand.

Claudico made very bad bluffs on rivers in certain spots that Doug and Bjorn set the bot up for and snap called over and over and over...

There are probably lots and lots of other Claudico weaknesses, too. For example, its inability to maximise winnings against weak players evident in the bot vs bot challenge. Again any low stakes human player is very good at maximising winnings versus weak players so that is a major skill deficit. (Most low stakes players arguably can't win unless they exploit the weak players in fact so Claudico would have a real problem in low stakes games failing to maximize winnings against the fish while getting killed by the rake).

I felt that Doug in particular stopped bothering to play seriously at about halfway so I have little doubt that on his A-game for the whole challenge he could have significantly increased his winrate. Whether he was fatigued or didn't want to publicly let his human competitors in high stakes HU know too much I don't know. Only Bjorn seemed to keep the pedal down all the way through the challenge. And while Cheet lost badly early on he was winning at 21bb/100 over the last half as he dialled in exploiting the bot's weaknesses so it would be interesting if he would post in the thread what he thinks are Claudico's key weaknesses.

The Humans have a very real incentive not to explain to the AI team all of the AI's errors! But just listing the ones they talked about means Claudico has serious weaknesses.

Last edited by TimTamBiscuit; 05-08-2015 at 09:52 PM.

Quote

05-08-2015 , 09:45 PM

#1391

restorativejustice

veteran

Join Date: Aug 2014 Posts: 2,753

Why should these falsehoods and misstatements bother poker players not involved in the Event?

Simple: Humans beating the "best" computer in the world (a computer that beat all its computer opponents) by a large margin as they did (again, over $9/hand on average over 80,000 hands) supports the argument that poker is a game of skill while undercutting people that denigrate it as simply "gambling" or dependent exclusively on luck.

That is important because the luck/skill argument is used by corrupt/stupid legislators that want to or have banned or limited the play of the game online. This results of this Event properly characterized would help expose those legislators as the morons/puppets that they are. Pity it will not be.

Quote

05-08-2015 , 09:51 PM

#1392

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Quote:

Originally Posted by restorativejustice

+1. I agree that this is very important. The challenge properly explained supports the contention that poker is a game of skill. Prof's false claims are misleading and self-serving as he wants to commercialise his software but they are not in the best interests of poker.

Quote

05-08-2015 , 11:22 PM

#1393

.isolated

Carpal \'Tunnel

Join Date: May 2012 Posts: 21,495

All of this talk is nonsense anyways because Doug at least wasn't taking it serious at the end. People wouldn't be playing in these tired (and forced) conditions irl. So that's just more against the bot. If he played against a fresh Doug, or Dong, or whatever HS killer wants to play it, it would've lost even more in ideal conditions.

Quote

05-09-2015 , 12:27 AM

#1394

theskillzdatklls

Pooh-Bah

Join Date: May 2008 Posts: 4,916

so cliffs are professor tilts a lot, says things that are scientifically incorrect and passes judgement multiple times in retrospect to have the science fit his views?

if i was a pro and ever doing this, i would ask for way, way more money next time. 100k min for my time next time with a big blind won/lost incentive.

Quote

05-09-2015 , 01:07 AM

#1395

Nick_AA

old hand

Join Date: Apr 2010 Posts: 1,458

Quote:

Originally Posted by TimTamBiscuit

Did not read it all for obvious reasons, but....your argument is extremely humorous to me. For having such obvious massive weaknesses sure did pretty good against 4 of the top players in the world. Pretty sure Doug would have beat Bjorn if he could but once he started down swinging (after the bot seemed to make some adjustments and fix the largest errors) I do believe he wanted the perception he didn't care and was messing around so fan boys would believe he "could win if he wanted to or really tried". Pretty tough for a huge ego to try your absolute best and come up short on such a public forum.

For what it's worth, I admire and respect Doug. My comments aren't meant to be negative. A massive ego is pretty much required to compete at the highest levels of anything.

Quote

05-09-2015 , 01:13 AM

#1396

icoon

adept

Join Date: Sep 2010 Posts: 744

I dont get statistical significance tests anyway
Why take a arbitrary threshold and make it Black/White significant or not significant when instead you could just report the p-value and let people decide for themselves what to think of it

Quote

05-09-2015 , 02:04 AM

#1397

TimTamBiscuit

veteran

Join Date: Oct 2007 Posts: 2,134

Quote:

Originally Posted by icoon

It is difficult for anyone not trained in statistics to understand but 95% confidence interval is commonly used as a trade-off between Type I and Type 2 statistical errors. Reporting the raw p-value is inherently misleading because the untrained don't know how to interpret the p-value correctly in terms of Type 1 and Type 2 errors (in terms of the risk of false positives or false negatives).

If we could have an infinite sample none of this occurs. We are dealing with inferential statistics which infer a belief about the population from a sample. Inferential statistics necessarily means there is always a Type 1 error risk and a Type 2 error risk.

I might add that academic staff outside the mathematical statistics departments such as computer science staff get this wrong all the time anyway. They are close to as bad as lay people. because they do not understand inferential statistics and they take for granted its complexity.

Ideally, the statistical power should be used to determine the sample size based on the expected sample effect before a study begins. In other words, a 95% confidence interval and about a 10bb/100 effect should have been used to determine you need about a 200K sample of hands but then 200K is impractical so 80K ends up as a practical compromise but you know going in a statistical tie is almost certain. A decent statistician knows all this before the study begins and alters study parameters appropriately to get a better chance of having enough stats power in the study.

Quote

05-09-2015 , 04:32 AM

#1398

punter11235

Carpal \'Tunnel

Join Date: Mar 2005 Posts: 8,210

Can we have data from the match used to calculate the variance?

Quote:

Please... The only not misleding thing to report is raw data, then you write:

MY INTEPRETATION OF THE RESULTS:

"Our arbitrary chosen p value at 95% as commonly used by our colleges indicated this and that".

It is arbitrary threshold. The term "statistically significant" is very misleading to people who don't understand statistics because they think about as it matters/it doesn't matter thing. As you can see even some professors think that way claiming "statistical tie" or what not.

Quote:

Exactly. You had some assumptions before the match. Let's even make it that you didn't know anything about playing entities. Now you have 80k hands played and you can say things like:

"Such experiment, if repeated would result in humans winning this % of the time". Simple and understandable for anyone without math background.

Last edited by punter11235; 05-09-2015 at 04:45 AM.

Quote

05-09-2015 , 04:38 AM

#1399

+VLFBERH+T

grinder

Join Date: Mar 2013 Posts: 657

Quote:

Originally Posted by TimTamBiscuit

The Humans have a very real incentive not to explain to the AI team all of the AI's errors! But just listing the ones they talked about means Claudico has serious weaknesses.

The challenge is over. What would be the "real incentive" to not divulge their opinion on Claudico's weaknesses to its creators ?

Quote

05-09-2015 , 05:30 AM

#1400

Wasp

enthusiast

Join Date: Feb 2010 Posts: 70

Quote:

Originally Posted by TimTamBiscuit

1. I don't think academics or developers are stupid. They can understand what p value is.
2. 95% confidence interval is commonly used. Yes, but it used ONLY by practicing statistic users (who of course know what p value is)
3. 95% confidence interval is arbitrary
4. this non-significant result probably caused only by the combination of insufficient sample size and wrongly determined p value
5. the p value was determined AFTER the experiment although we asked to tell us BEFORE the experiment what determines "significance"

As far as I see
- If p=0.05 was determined BEFORE the experiment they made an experiment where nobody can win and they can introduce their AI as a virtually fair competitior to the best human minds which is a LIE.
- If p value was determined AFTER the experiment then the whole method against the scientific way of measuring an experiment (so it seem somebody bending the result to favor Claudico) which is scientifically a LIE.

(two more things:
1. If the report of the event is not for practicing academian, but for average joes, it would be much better if they tell us the bot and the human team estimated ELO score on (regarding 10-20-40-80k game as a sample, and giving for example 1000 ELO for Brains)
2. anyways I don't think the sample size was 40k instead 80k. From the viewpoint of any player, they played 20k independent hand and there was 4 person who played. Doesn't matter that it was mirrorred or wasn't.)

Quote

Page 56 of 65

First

6 36 46 51 52 53 54 55 56 57 58 59 60 61

Last

Post Reply Subscribe

...

Page 56 of 65

First

6 36 46 51 52 53 54 55 56 57 58 59 60 61

Last