Counterfactual Regret Minimization (CFR) - Computer Technical Help

Two Plus Two Forums Other Topics Computer and Technical Help

Counterfactual Regret Minimization (CFR)

Post Reply Subscribe

...

01-19-2017 , 03:58 AM

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

I'd really like to start using CFR to write my own programs to help me study the game even more.

I was successfully able to use CFR to work through Rock-Paper-Scissors, which seemed trivial. So I decided to take what I learned from that and apply it to a river situation that was worked through in Will Tipton's book, Expert Heads Up No Limit Hold'Em (pg 48).

The book breaks it down giving both players a range on the river. A pot size of 53bb and effective stacks of 48.5bb.

Hero can check or shove. If hero checks it's assumed that villain always shoves and hero always folds. If hero shoves, villain can either call or fold.

I feel like I used an identical algorithmic approach to this problem as I did with Rock-Paper-Scissors, however, for some reason I'm coming up with a different answer than what was given in the book. The book says hero will shove all hands and villain will call 98 of them (the NE). I'm getting that hero will shove all of them but villain will only call 68 of his combos of hands (Average Strategy and Overall Strategy both came to this conclusion).

I also created a test of my own where hero has 2 hand combos, one a value hand the other a bluff, and villain has 4 hands all of which can beat the bluff and one of which that beats the value also. For some reason, with villain shoving both combos, it has as the Average Strategy villain only calling with the the hand that beats the bluffs and the value hand. The overall Strategy, not the average, has villain calling with 3 combos which makes sense.

I've looked it over multiple times trying to breakdown where the flaw could be, but at this point I'm lost and was hoping someone here can help me out.

GitHub Link

Quote

01-19-2017 , 05:53 AM

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

I changed a few things after emailing Michael Johanson (University of Alberta), instead of iterating over each hand in the players' ranges I've switched to using a Monte Carlo type approach.

I also realized that I wasn't taking into account card removal effects. After changing these things not only did I speed up my code by a decent amount but after taking into account card removal I've increased how often the program believes villain should call to 80ish combos. But, this is still less than the 98 the book says is correct.

If anyone has any suggestions, or is aware of something I've missed, I could really use the help.

Quote

01-19-2017 , 05:56 AM

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

There are a few hands that to me would seem to be definite calls that the program doesn't agree with. For instance, villain has 3 combos of Aces, Kings and Queens. And I've tested this multiple times by running hero's range against just those 9 holdings, together and one by one, and every time it says that you should call with Aces, but fold the Kings and Queens. Which to me is bizarre, they have to have nearly the same value on this board.

Quote

01-19-2017 , 09:17 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

I just quickly looked over your code and my python is really bad, so nothing in-depth.

Two observations:
*) You calculate regret for action x as (utility_x - utility_chosen_action), it's usually (utility_x - utility_strategy). I suspect that this shouldn't cause wrong strategies, but it may cause slower convergence.
*) Re: performance, you probably shouldn't use the 2+2 hand eval for Monte Carlo calculations if you care about speed. It's like a magnitude slower than other evals in your use case. May not matter in python, idk.

Quote

01-19-2017 , 09:39 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

This looks like a bug in your card-removal check:

Code:

        while unique == False:
            hand = tuple(random.choice(hero_range))
            for c in hand:
                if c in hand2: 
                    unique = False
                else:
                    unique = True

Again, my python is bad, but it looks like this will evaluate to unique = True as long as the last card in hand is not in hand2.

Something like this should work:

Code:

        while unique == False:
            hand = tuple(random.choice(hero_range))
            unique = True
            for c in hand:
                if c in hand2: 
                    unique = False

Quote

01-19-2017 , 10:10 AM

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

Quote:

Originally Posted by plexiq

In regards to the first, I haven't seen it calculated that way. But most of my knowledge of CFR is from the paper "An Introduction to Counterfactual Regret Minimization", where they explain CFR and use it to solve Rock-Paper-Scissors.

In regards to the second, I honestly didn't even think about this, but thank you for the suggestion, I'll look into some other hand evaluators.

Quote:

Originally Posted by plexiq

This looks like a bug in your card-removal check:

Code:

        while unique == False:
            hand = tuple(random.choice(hero_range))
            for c in hand:
                if c in hand2: 
                    unique = False
                else:
                    unique = True

Again, my python is bad, but it looks like this will evaluate to unique = True as long as the last card in hand is not in hand2.

Something like this should work:

Code:

        while unique == False:
            hand = tuple(random.choice(hero_range))
            unique = True
            for c in hand:
                if c in hand2: 
                    unique = False

Omg, thank you! I can't believe I messed something so simple up. After this change I'm getting the right answer. Seriously, thank you, I was starting to lose my mind.

Quote

01-19-2017 , 10:39 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Quote:

Originally Posted by 4bet20x

Ah ok, that's actually how they do it in the RPS example of section "Regret Matching and Minimization".

For CFR they consistently use the other version though, e.g. in that same paper section 3.3, "Counterfactual Regret Minimization" page 11:

Quote:

The difference between the value of always choosing action a and the expected value when the
players use [the strategy profile] σ is an action’s regret, which is then weighted by the probability that other player(s)
(including chance) will play to reach the node.

Quote

01-19-2017 , 11:02 AM

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

Quote:

Originally Posted by plexiq

Isn't my code doing that? I mean, it's choosing a random action but the random action is in correlation to the current strategy profile. Unless I'm misunderstanding, that's what the following function does isn't it?

Code:

    def getAction(self,hand):
        r = random.random()
        a = 0
        cumulativeProbability = 0.0
        while a < (self.actions):
            cumulativeProbability += self.strategy[hand]["strategy"][a]
            if r < cumulativeProbability:
                return a
            a+=1

It makes it more likely that it chooses the current best action in regards to the strategy profile at the current time step. It just doesn't always pick it. Somewhat like a greedy function?

Quote

01-19-2017 , 11:22 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Your version subtracts the correct average, so it won't affect the correctness. But using the strategy utility instead of the random action utility reduces the variance and should help slightly with convergence speed. This is at basically no additional cost to runtime, you already have all the utilities to calculate the strategy utility anyway.

Random example, we are at the first iteration and you have two actions (a, b). Both cumulative regrets are still zero and your strategy profile for that iteration is (0.5, 0.5).

You calculate the utilities as utility_a=1, utility_b=0.

Your version results in the following if you randomly picked action a:
Regrets: (0, -1)
Next profile: (0.5, 0.5) (!)

Your version results in the following if you randomly picked action b:
Regrets: (1, 0)
Next profile: (1, 0)

The alternative version always results in the following:
Regrets: (0.5, -0.5)
Next profile: (1, 0)

Check the example on page 16 of that paper, they use "nodeUtil" there.

Code:

for (int a = 0; a < NUM_ACTIONS; a++) {
...
nodeUtil += strategy[a] * util[a];
}

for (int a = 0; a < NUM_ACTIONS; a++) {
double regret = util[a] - nodeUtil;
node.regretSum[a] += (player == 0 ? p1 : p0) * regret;
}

Last edited by plexiq; 01-19-2017 at 11:42 AM.

Quote

01-19-2017 , 12:28 PM

#10

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

Quote:

Originally Posted by plexiq

Code:

for (int a = 0; a < NUM_ACTIONS; a++) {
...
nodeUtil += strategy[a] * util[a];
}

for (int a = 0; a < NUM_ACTIONS; a++) {
double regret = util[a] - nodeUtil;
node.regretSum[a] += (player == 0 ? p1 : p0) * regret;
}

Alright, sorry, I had to think about that for awhile before I understood what was actually happening. It still a little fuzzy but I think I'm getting it.

Is this right? Or closer to being right?

Code:

     # Accumulate Action Regrets # Train 
        p1nodeUtil = 0
        p2nodeUtil = 0
        for a in range(2):
            p1nodeUtil += p1.strategy[hand]["strategy"][a] * p1_utilities[a][p2Action]
            p2nodeUtil += p2.strategy[hand2]["strategy"][a] * p2_utilities[p1Action][a]
        for a in range(2):
            p1.strategy[hand]["regretSum"][a] += (p1_utilities[a][p2Action] - p1nodeUtil)
            p2.strategy[hand2]["regretSum"][a] += (p2_utilities[p1Action][a] - p2nodeUtil)

It's a bit like the update in QLearning, if I'm not mistaken.

Quote

01-19-2017 , 01:02 PM

#11

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Yup, that looks correct.

Quote

01-19-2017 , 01:13 PM

#12

4bet20x

journeyman

Join Date: Jan 2017 Posts: 222

Thanks for the help.

Quote

Post Reply Subscribe

...