CFR RPS code - Computer Technical Help - Software and Hardware Forum

Two Plus Two Forums Other Topics Computer and Technical Help

CFR RPS code

Post Reply Subscribe

...

05-15-2014 , 11:48 PM

ibavly

2010 PSOP Champion

Join Date: Feb 2010 Posts: 24,513

Hi,

I'm new to programming and cfr, could you guys check the code, I'm expecting to get a different result.

The game is RPS where the opponent has an imbalanced strategy. He is throwing scissors 40%.

What should happen: The program learns to throw Rock 100%
What I expect to happen: The regrets will build up much faster for rock than for the other two, so over a large sample the other two options will tend towards zero but will still stay non-zero
What is actually happening: The result being outputted is {.3,.3,.3}, which is the NE but not what should be happening.

Code:

import java.util.Arrays;
import java.util.Random;

public class RPSTrainer
{
    // set values to each move and for total number of actions
    // the actual numbers assigned to these moves don't matter in this case since game is symettric
    public static final int ROCK = 0, PAPER = 1, SCISSORS = 2, NUM_ACTIONS = 3;
    // initiate randomizer
    public static final Random random = new Random();
    // create array to store regrets
    double[] regretSum = new double[NUM_ACTIONS],
    // create array to store temporary strategies
    strategy = new double[NUM_ACTIONS],
    // create array to store total average strategy
    strategySum = new double[NUM_ACTIONS],
    // define opponent's strategy for the non gto game
    oppStrategy = { 0.4, 0.3, 0.3 };
    
    // Main method to run the whole training session
    public static void main(String args[]) {
        RPSTrainer trainer = new RPSTrainer();
        trainer.train(1000000);
        System.out.println(Arrays.toString(trainer.getAverageStrategy()));
    }
    
    
    // returns an array with a defined strategy based on the number of regrets during training
    // for each action divides that action's regrets by the total number of regrets to determine frequency
    // if there are no regrets each action is given an equal liklihood
    private double[] getStrategy() {
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++) {
            strategy[a] = regretSum[a] > 0 ? regretSum[a] : 0;
            normalizingSum += strategy[a];
        }
        for (int a = 0; a < NUM_ACTIONS; a++) {
            if (normalizingSum > 0) {
                strategy[a] /= normalizingSum;
            }
            else {
                strategy[a] = 1.0 / NUM_ACTIONS;
                strategySum[a] += strategy[a];
            }
        }
        return strategy;
    }
    
    // returns a number to represent the action chosen by Hero, as defined at the top of the code
    // randomizes a number, then runs through each option, stopping when the cumalitve prob
    // is greater than the random number
    public int getAction(double[] strategy) {
        double r = random.nextDouble();
        int a = 0;
        double cumulativeProbability = 0;
        while (a < NUM_ACTIONS - 1) {
            cumulativeProbability += strategy[a];
            if (r < cumulativeProbability) {
                break;
            }
            a++;
        }
        return a;
    }
    
    // runs a given number of iterations of training to generate regrets
    // after each iteration adds the regrets for that action to the sum of regrets
    public void train(int iterations) {
        double[] actionUtility = new double[NUM_ACTIONS];
        for (int i = 0; i < iterations; i++) {
            //Get regret-matched mixed-strategy actions
            double[] strategy = getStrategy();
            int myAction = getAction(strategy);
            int otherAction = getAction(oppStrategy);
            //Compute action utilities
            actionUtility[otherAction] = 0;
            actionUtility[otherAction == NUM_ACTIONS - 1 ? 0 : otherAction + 1] = 1;
            actionUtility[otherAction == 0 ? NUM_ACTIONS - 1 : otherAction - 1] = -1;
            //Accumulate action regrets
            for (int a = 0; a < NUM_ACTIONS; a++)
            {
                regretSum[a] += actionUtility[a] - actionUtility[myAction];
                
            }
        }
    }
    
    // returns the average strategy over all the iterations so far
    // normailzes the total number of regrets for all actions
    public double[] getAverageStrategy() {
        double[] avgStrategy = new double[NUM_ACTIONS];
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++)
            normalizingSum += strategySum[a];
        for (int a = 0; a < NUM_ACTIONS; a++)
        if (normalizingSum > 0)
            avgStrategy[a] = strategySum[a] / normalizingSum;
        else {
            avgStrategy[a] = 1.0 / NUM_ACTIONS;
        }
        return avgStrategy;
    }
    
}

Can you figure out the mistake?

Quote

05-16-2014 , 05:17 AM

Craggoo

banned

Join Date: Sep 2006 Posts: 12,644

Is this the year where everyone switches from "poker pro" to "computer programmer"?

Quote

05-16-2014 , 08:37 AM

nickthegeek

journeyman

Join Date: Sep 2011 Posts: 273

Quote:

Originally Posted by ibavly

Code:

import java.util.Arrays;
import java.util.Random;

public class RPSTrainer
{
    // set values to each move and for total number of actions
    // the actual numbers assigned to these moves don't matter in this case since game is symettric
    public static final int ROCK = 0, PAPER = 1, SCISSORS = 2, NUM_ACTIONS = 3;
    // initiate randomizer
    public static final Random random = new Random();
    // create array to store regrets
    double[] regretSum = new double[NUM_ACTIONS],
    // create array to store temporary strategies
    strategy = new double[NUM_ACTIONS],
    // create array to store total average strategy
    strategySum = new double[NUM_ACTIONS],
    // define opponent's strategy for the non gto game
    oppStrategy = { 0.4, 0.3, 0.3 };
    
    // Main method to run the whole training session
    public static void main(String args[]) {
        RPSTrainer trainer = new RPSTrainer();
        trainer.train(1000000);
        System.out.println(Arrays.toString(trainer.getAverageStrategy()));
    }
    
    
    // returns an array with a defined strategy based on the number of regrets during training
    // for each action divides that action's regrets by the total number of regrets to determine frequency
    // if there are no regrets each action is given an equal liklihood
    private double[] getStrategy() {
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++) {
            strategy[a] = regretSum[a] > 0 ? regretSum[a] : 0;
            normalizingSum += strategy[a];
        }
        for (int a = 0; a < NUM_ACTIONS; a++) {
            if (normalizingSum > 0) {
                strategy[a] /= normalizingSum;
            }
            else {
                strategy[a] = 1.0 / NUM_ACTIONS;
                strategySum[a] += strategy[a];
            }
        }
        return strategy;
    }
    
    // returns a number to represent the action chosen by Hero, as defined at the top of the code
    // randomizes a number, then runs through each option, stopping when the cumalitve prob
    // is greater than the random number
    public int getAction(double[] strategy) {
        double r = random.nextDouble();
        int a = 0;
        double cumulativeProbability = 0;
        while (a < NUM_ACTIONS - 1) {
            cumulativeProbability += strategy[a];
            if (r < cumulativeProbability) {
                break;
            }
            a++;
        }
        return a;
    }
    
    // runs a given number of iterations of training to generate regrets
    // after each iteration adds the regrets for that action to the sum of regrets
    public void train(int iterations) {
        double[] actionUtility = new double[NUM_ACTIONS];
        for (int i = 0; i < iterations; i++) {
            //Get regret-matched mixed-strategy actions
            double[] strategy = getStrategy();
            int myAction = getAction(strategy);
            int otherAction = getAction(oppStrategy);
            //Compute action utilities
            actionUtility[otherAction] = 0;
            actionUtility[otherAction == NUM_ACTIONS - 1 ? 0 : otherAction + 1] = 1;
            actionUtility[otherAction == 0 ? NUM_ACTIONS - 1 : otherAction - 1] = -1;
            //Accumulate action regrets
            for (int a = 0; a < NUM_ACTIONS; a++)
            {
                regretSum[a] += actionUtility[a] - actionUtility[myAction];
                
            }
        }
    }
    
    // returns the average strategy over all the iterations so far
    // normailzes the total number of regrets for all actions
    public double[] getAverageStrategy() {
        double[] avgStrategy = new double[NUM_ACTIONS];
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++)
            normalizingSum += strategySum[a];
        for (int a = 0; a < NUM_ACTIONS; a++)
        if (normalizingSum > 0)
            avgStrategy[a] = strategySum[a] / normalizingSum;
        else {
            avgStrategy[a] = 1.0 / NUM_ACTIONS;
        }
        return avgStrategy;
    }
    
}

Can you figure out the mistake?

I think the error is in getStrategy:

Code:

private double[] getStrategy() {
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++) {
            strategy[a] = regretSum[a] > 0 ? regretSum[a] : 0;
            normalizingSum += strategy[a];
        }
        for (int a = 0; a < NUM_ACTIONS; a++) {
            if (normalizingSum > 0) {
                strategy[a] /= normalizingSum;
            }
            else {
                strategy[a] = 1.0 / NUM_ACTIONS;
  //              strategySum[a] += strategy[a];
            }
          strategySum[a] += strategy[a];   
        }
        return strategy;
    }

Quote

05-16-2014 , 10:42 AM

ibavly

2010 PSOP Champion

Join Date: Feb 2010 Posts: 24,513

That worked thanks!

As expected we didn't get a pure strategy, but really close with 99.99% paper

Quote

05-17-2014 , 01:27 PM

ibavly

2010 PSOP Champion

Join Date: Feb 2010 Posts: 24,513

Now I tried to make the RPS work where both players are using regret matching.

I got the right result, but since I'm such a noob I was just guessing on a bunch of stuff, so it's possible I got the right result despite my mistakes. I'm not sure if the way I changed getAverageStrategy() to include an inputted array is correct, and also don't know exactly what I did to make the println work. How does trainer.getAverageStrategy(trainer.strategySum) reference the definition properly? I don't understand why it doesn't work like getAction(strategy) which is much cleaner.

I'm also not sure if there are spots I could simplify the code. Specifically in the getStrategy function, I chose to make another getoppStrategy function because I needed to update the regrets and strategySums, and couldn't think of any good way of differentiating them within the function.

Code:

import java.util.Arrays;
import java.util.Random;

public class RPSTrainer
{
    // set values to each move and for total number of actions
    // the actual numbers assigned to these moves don't matter in this case since game is symettric
    public static final int ROCK = 0, PAPER = 1, SCISSORS = 2, NUM_ACTIONS = 3;
    // initiate randomizer
    public static final Random random = new Random();
    // create array to store regrets
    double[] regretSum = new double[NUM_ACTIONS],
    // create array to store opponent's regrets
    oppRegretSum = new double[NUM_ACTIONS],
    // create array to store temporary strategies
    strategy = new double[NUM_ACTIONS],
    // create array to store total average strategy
    strategySum = new double[NUM_ACTIONS],
    // create array to store temporary opponent strategies
    oppStrategy = new double[NUM_ACTIONS],
    //create array to store total average opponent strategy
    oppStrategySum = new double[NUM_ACTIONS];
    
    // Main method to run the whole training session
    public static void main(String args[]) {
        RPSTrainer trainer = new RPSTrainer();
        trainer.train(1000000);
        System.out.println("Hero Strat:");
        System.out.println(Arrays.toString(trainer.getAverageStrategy(trainer.strategySum)));
        System.out.println("Villain Strat");
        System.out.println(Arrays.toString(trainer.getAverageStrategy(trainer.oppStrategySum)));
    }
    
    
    // returns an array with a defined strategy based on the number of regrets during training
    // for each action divides that action's regrets by the total number of regrets to determine frequency
    // if there are no regrets each action is given an equal liklihood
    private double[] getStrategy() {
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++) {
            strategy[a] = regretSum[a] > 0 ? regretSum[a] : 0;
            normalizingSum += strategy[a];
        }
        for (int a = 0; a < NUM_ACTIONS; a++) {
            if (normalizingSum > 0)
                strategy[a] /= normalizingSum;
            else
                strategy[a] = 1.0 / NUM_ACTIONS;
            strategySum[a] += strategy[a];
        }
        return strategy;
    }
    // returns an array with a defined strategy based on the number of regrets during training
    // for each action divides that action's regrets by the total number of regrets to determine frequency
    // if there are no regrets each action is given an equal liklihood
    private double[] getOppStrategy() {
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++) {
            oppStrategy[a] = oppRegretSum[a] > 0 ? oppRegretSum[a] : 0;
            normalizingSum += oppStrategy[a];
        }
        for (int a = 0; a < NUM_ACTIONS; a++) {
            if (normalizingSum > 0)
                oppStrategy[a] /= normalizingSum;
            else
                oppStrategy[a] = 1.0 / NUM_ACTIONS;
            oppStrategySum[a] += oppStrategy[a];
        }
        return oppStrategy;
    }
    // returns a number to represent the action chosen by Hero, as defined at the top of the code
    // randomizes a number, then runs through each option, stopping when the cumalitve prob
    // is greater than the random number
    public int getAction(double[] strategy) {
        double r = random.nextDouble();
        int a = 0;
        double cumulativeProbability = 0;
        while (a < NUM_ACTIONS - 1) {
            cumulativeProbability += strategy[a];
            if (r < cumulativeProbability) {
                break;
            }
            a++;
        }
        return a;
    }
    
    // runs a given number of iterations of training to generate regrets
    // after each iteration adds the regrets for that action to the sum of regrets
    public void train(int iterations) {
        double[] actionUtility = new double[NUM_ACTIONS];
        for (int i = 0; i < iterations; i++) {
            //Get regret-matched mixed-strategy actions
            double[] strat = getStrategy();
            double[] oppStrat = getOppStrategy();
            int myAction = getAction(strat);
            int otherAction = getAction(oppStrat);
            //Compute action utilities for Hero
            actionUtility[otherAction] = 0;
            actionUtility[otherAction == NUM_ACTIONS - 1 ? 0 : otherAction + 1] = 1;
            actionUtility[otherAction == 0 ? NUM_ACTIONS - 1 : otherAction - 1] = -1;
            //Accumulate action regrets for Hero
            for (int a = 0; a < NUM_ACTIONS; a++)
            {
                regretSum[a] += actionUtility[a] - actionUtility[myAction];
                
            }
            //Compute action utilities for Villain
            actionUtility[myAction] = 0;
            actionUtility[myAction == NUM_ACTIONS - 1 ? 0 : myAction + 1] = 1;
            actionUtility[myAction == 0 ? NUM_ACTIONS - 1 : myAction - 1] = -1;
            //Accumulate action regrets for Villain
            for (int a = 0; a < NUM_ACTIONS; a++)
            {
                oppRegretSum[a] += actionUtility[a] - actionUtility[otherAction];
                
            }
        }
    }
    
    // returns the average strategy over all the iterations so far
    // normailzes the total number of regrets for all actions
    public double[] getAverageStrategy(double[] strategySum) {
        double[] avgStrategy = new double[NUM_ACTIONS];
        double normalizingSum = 0;
        for (int a = 0; a < NUM_ACTIONS; a++)
            normalizingSum += strategySum[a];
        for (int a = 0; a < NUM_ACTIONS; a++)
        if (normalizingSum > 0)
            avgStrategy[a] = strategySum[a] / normalizingSum;
        else {
            avgStrategy[a] = 1.0 / NUM_ACTIONS;
        }
        return avgStrategy;
    }
    
}

Quote

05-17-2014 , 05:10 PM

Allen C

journeyman

Join Date: Oct 2004 Posts: 379

I don't know java but I'm pretty sure this stuff is universal so bear with me.

Quote:

I'm not sure if the way I changed getAverageStrategy() to include an inputted array is correct

It works correctly but here are a few thoughts. Since your strategySum is also a class level variable you could just access it directly without having to enter it as a parameter. You use the constant NUM_ACTIONS in the method so it wont work anyway unless your argument is an array of length NUM_ACTIONS. If you wanted the method to work on any array, you need to find out it's length using an arraylength method or as an additional parameter. GetAverageStrategy seems like a bad name, it should just be called normalizeStrategy as that's all its doing. The average strategy in CFR is something specific so that's confusing. I'm pretty sure you need curly braces around that for loop. Also, changing it to two for loops inside the if else statement would be free efficiency (more important in getStrategy, as that's inside the big loop).

Quote:

How does trainer.getAverageStrategy(trainer.strategySum) reference the definition properly? I don't understand why it doesn't work like getAction(strategy) which is much cleaner.

It does work lke getAction if you're talking about the parameters and class variables. In both you're inputting an array as an argument, using it's contents to compute something, and returning the result. If you're referring to the fact that you use trainer.getAverageStrategy but just getAction on its own, it seems to be because you've made an instance of your class in the first case, and you are calling the instance's method. In the second case you're calling the method from within the class's definition, and I guess in Java you don't have to reference the class again to do that.

In my humble opinion I would suggest doing away with the class/global variables completely as they just make it harder to understand what's going on and generally don't help matters much. In getStrategy(strategy) for instance they are just plain conflicting. Create them in the method they are used and pass them around as needed.

For instance, getStrategy and getOpponentStrategy are exactly the same. It should just take a parameter of a regret array and you can enter the opponent's or the hero's as necessary, just like you did in getAction.

Quote

05-17-2014 , 07:45 PM

jh1711

No freakin title????????

Join Date: May 2012 Posts: 3,743

Quote:

Originally Posted by ibavly

I'm not sure if the way I changed getAverageStrategy() to include an inputted array is correct, and also don't know exactly what I did to make the println work. How does trainer.getAverageStrategy(trainer.strategySum) reference the definition properly? I don't understand why it doesn't work like getAction(strategy) which is much cleaner.

main is a static function of the class RPSTrainer; strategySum is a variable of the object trainer. If you'd introduce a new method, you could reference the variables more intuitively. E.g. this should work, when you call trainer.print() after trainer.train() in main (untested):

Code:

public void print() {
  System.out.println("Hero Strat:");
  System.out.println(Arrays.toString(getAverageStrategy(strategySum)));
  System.out.println("Villain Strat");
  System.out.println(Arrays.toString(getAverageStrategy(oppStrategySum)));
}

Quote:

Originally Posted by ibavly

I'm also not sure if there are spots I could simplify the code. Specifically in the getStrategy function, I chose to make another getoppStrategy function because I needed to update the regrets and strategySums, and couldn't think of any good way of differentiating them within the function.

Your functions modify variables outside their scope and should be named accordingly, or stop doing that. You can choose between more procedural or more OO.

I hope you don't mind an additional remark: Your code would be more readable if you followed the always type, always block principle.

Quote

06-23-2017 , 10:57 AM

CaptnCanary

newbie

Join Date: Jan 2007 Posts: 19

I know this is old, but I'm working on the same problem and came across this. I took a look at your code and I think the only reason that you get the right answer is a coincidence. The strategies initialise as 0,0,0 and the getStrategy method changes them to 0.33, 0.33, 0.33 as a starting point. It only runs one iteration so it just stops there. I think you need to initialise each competitor with a specific opponent strategy and run iterations from that point. I've got it running but it doesn't converge for me yet. I'm not sure how long it should take.

Last edited by CaptnCanary; 06-23-2017 at 11:20 AM.

Quote

06-30-2017 , 05:39 PM

mediacalc

journeyman

Join Date: Jul 2016 Posts: 283

Quote:

Originally Posted by CaptnCanary

It's safe to say that OP has either fixed the problem or moved on from it entirely in the 3 years since he made this thread

Quote

06-30-2017 , 09:55 PM

#10

ibavly

2010 PSOP Champion

Join Date: Feb 2010 Posts: 24,513

Quote:

Originally Posted by CaptnCanary

That did the trick, thanks!

Quote

10-31-2017 , 07:18 PM

#11

leavesofliberty

Carpal \'Tunnel

Join Date: Jul 2010 Posts: 6,296

Quote:

Originally Posted by Craggoo

Is this the year where everyone switches from "poker pro" to "computer programmer"?

Yes, yes it is. And the code is all in a pdf.

http://modelai.gettysburg.edu/2013/cfr/cfr.pdf

in b4 lol JAVA. Because, you know, everyone in groundbreaking research is like, omg lets use JAVA.

Quote

Post Reply Subscribe

...