Introducing my hand evaluator & equity calculator - Computer Technical Help

Two Plus Two Forums Other Topics Computer and Technical Help

Introducing my hand evaluator & equity calculator

Post Reply Subscribe

...

07-29-2016 , 11:36 PM

Zekl

stranger

Join Date: Jul 2016 Posts: 4

I know I'm kinda late to the party writing a hand evaluator in year 2016, but I actually didn't find that many good open source libraries for poker evaluation. There's SKPokerEval which is reasonably fast, and there's 2+2 evaluator, which is pretty much optimal at sequential evaluation but really sucks at Monte Carlo because of the 130MB lookup. There's also Pokerstove which I didn't try but the code looked slow.

Anyway I wrote my own C++ evaluator and equity calculator, and I felt like I managed to come up with some cool ideas so I thought I'd share the code. It's available at GitHub. The evaluator is based on the same type of lookup method as SKPokerEval, but I'm also using perfect hashing, faster flush check, and a 128-bit structure for storing partial hand data. I really wanted to get it down to 64 bits, but couldn't find a way to fit the flush data. And then I found out I can use a single SSE4 op to combine the structs which gives it a small speed boost so it's kinda okay using a 128-bit struct.

I'm currently getting 720M evals/s (with MSVC it's slower though) when enumerating, which is 3x faster than SKPE, and half the speed of 2+2 evaluator, but honestly it's getting to the point where it hardly matters anymore because for any practical purpose the overhead from the loop and processing the results is going to dominate. For example my equity calculator is only getting 75M showdowns/s for two players (so it's like 150M evals/s, i.e. evaluation is maybe 1/4 of the total cost).

In random evaluation I'm getting 235M/s so it beats both SKPE (146M/s) and 2+2 (19M/s). The biggest difference to 2+2 of course comes from caching. After perfect hashing my lookup is only 400kB. I could get it to 200kB if I dropped the support for evaluating fewer than 7 cards, but there was negligible performance difference.

For the equity calculator I created both Monte Carlo and enumeration algorithms, and both support multithreading. The Monte Carlo simulation is based on a random walk algorithm which changes only one of the hands on each iteration. It's much faster than a naive monte carlo and yet in practice it has the same variance. But the biggest benefit is that it avoids the problem of having to to do a full resample on holecard collisions. There are scenarios where the naive method fails 99.9% of the time and it slows down to 0.1M hands/s, but the worst case I've seen for the random walker is around 10M/s. With too many dead cards it gets worse but that's fixable. Overall it works amazingly well and beats Equilab to the ground! (Btw does anyone know if Power-Equilab is any better?)

The enumeration algo is maybe not as good but still much faster than Equilab. I use a lookup table for detecting preflop isomorphism, and it works really well as long as the lookup table doesn't get full. After that it kinda suffers. The lookup also cannot be used if the postflop is too fast to iterate (4+ fixed board cards) because of the overhead. In postflop my algorithm also merges the suits that cannot make a flush anymore which gives it a nice 3x speed boost.

I'd be happy to read any comments/suggestions and answer questions.

Quote

07-30-2016 , 07:56 AM

adios

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 22,530

Muchas gracias, cool beans, I'll have a look.

Quote

07-31-2016 , 12:46 PM

etale.cohomology

stranger

Join Date: Jul 2016 Posts: 6

I'm also coding an equity calculator! The main requirement is to estimate equities in full-ring games, so it should have fairly strong Monte Carlo capabilities.

(I don't think writing an evaluator & equity calculator in 2016 is out of order; I've looked extensively and my conclusion is that the open source supply of such tools is lacking.)

Note: all timings that follow refer to a single Xeon core @2.2GHz.

The first stage was to find a good PRNG. I tested a bunch, and the fastest was this PCG snippet:

Code:

typedef struct{
    uint_fast64_t state;
    uint_fast64_t inc;
}pcg32_t;

static inline uint_fast32_t pcg32(volatile pcg32_t* rng){
    uint_fast64_t oldstate = rng->state;
    rng->state = oldstate*6364136223846793005ULL + (rng->inc|1u);  // Advance internal state
    uint_fast32_t xorshifted = ((oldstate>>18u)^oldstate) >> 27u;
    uint_fast32_t rot = oldstate>>59u;
    return (xorshifted>>rot) | (xorshifted << ((-rot)&31u));
}

It can run about 2 billion times each second, generating 32 random bits each time.

(Quick question. I've noticed you use XorShiftPlus and MT199937. Which do you use for what kind of stuff?)

I'm currently in the second stage, which (for me) is to find a decent evaluator. Since the idea is to get Monte Carlo equities, we need random evals, so any LUT that doesn't fit in L3 cache (say, around 30MB) is out of the question. (I'm ruling out 2+2 and RayW.)

The fastest such evaluator I've found so far is ACE_eval, which you can also get on Github. It has a few versions, but the fastest one (ace_eval_decompress.c) uses no LUTs at all. I can get about 90M evals per sec.

Each card is a 32-bit int. The 4 least-significant bits encode the suit, the next 2 bits are always zero, and the rest contain the rank. The whole deck looks as follows:

Code:

static const uint32 CARD_KEYS[52] = {
    0x41, 0x101, 0x401, 0x1001, 0x4001, 0x10001, 0x40001, 0x100001, 0x400001, 0x1000001, 0x4000001, 0x10000001, 0x40000001,
    0x42, 0x102, 0x402, 0x1002, 0x4002, 0x10002, 0x40002, 0x100002, 0x400002, 0x1000002, 0x4000002, 0x10000002, 0x40000002,
    0x44, 0x104, 0x404, 0x1004, 0x4004, 0x10004, 0x40004, 0x100004, 0x400004, 0x1000004, 0x4000004, 0x10000004, 0x40000004,
    0x48, 0x108, 0x408, 0x1008, 0x4008, 0x10008, 0x40008, 0x100008, 0x400008, 0x1000008, 0x4000008, 0x10000008, 0x40000008};

The cards, in human-readable form, can be recorvered from the following table:

Code:

static const char* CARD_VALS[52] = {
    "2c", "3c", "4c", "5c", "6c", "7c", "8c", "9c", "Tc", "Jc", "Qc", "Kc", "Ac",
    "2d", "3d", "4d", "5d", "6d", "7d", "8d", "9d", "Td", "Jd", "Qd", "Kd", "Ad",
    "2h", "3h", "4h", "5h", "6h", "7h", "8h", "9h", "Th", "Jh", "Qh", "Kh", "Ah",
    "2s", "3s", "4s", "5s", "6s", "7s", "8s", "9s", "Ts", "Js", "Qs", "Ks", "As"};

So, for example, 2c is 0x41, 3c is 0x101, etc.

Each hand is then an array of five 32-bit ints, but it's built rather non-intuively, so I omit the construction.

The evaluation algorithm itself has a lot of branching, but I can't tell how much this affects performance. (Isn't a branch misprediction supposed to cost like 5ns or something? Just a bit over an L1 cache hit?)

At a high level, the branching structure of the algorithm looks at follows:

Code:

if (something)
        return 4 of a kind
else if (something)
        return full house
else if (something)
        return full house
else
        if (something)
                we have a flush, continue
        else if (something)
                we have a flush, continue
        else if (something)
                we have a flush, continue
        else if (something)
                we have a flush, continue

        if (something)
                return straight (or straight flush, if we also had a flush)
        else if (something)
                return flush
        else if (something)
                return 3 of a kind
        else if (something)
                if (something)
                        return 2 pair
                if (something)
                        return 1 pair

    return high card

I've tried computing equities with Rayeval (Github here), based on RayW's legendary LUT, but I only get about 10M simulations per second.

Anyways, I can't tell you how glad I am to see that people are still interested in this topic, since I think it's not yet quite solved! My intuition is that, with current hardware and an algorithm with little/no branching and tiny LUTS (or none at all), it'd be possible to run a few billions of Monte Carlo simulations per second on massively parallel architectures. It may be wishful thinking, but hey...

Anyways, to cut this short, I'm gonna do a deep dive of your code, but it's already looking better than the best stuff I could find.

Quote

07-31-2016 , 02:46 PM

Zekl

stranger

Join Date: Jul 2016 Posts: 4

Quote:

Originally Posted by etale.cohomology

(Quick question. I've noticed you use XorShiftPlus and MT199937. Which do you use for what kind of stuff?)

I was just always under the assumption that Mersenne Twister was one of the highest quality PRNGs, altough not the fastest. So my plan was to use XorShift128+ for performance critical stuff and MT199937 elsewhere. But apparently MT199937 actually fails more statistical tests, so it's probably never worth using it. And XorShift128+ is already good enough for anything poker related, especially if you only need something like 0.01% error margin in Monte Carlo.

There's actually an improved version of it, Xoroshiro128+, which seems strictly better, and I was about to switch to that. (Although it requires that compiler can optimize rotl to one instruction, otherwise there's no speed gain over XorShift128+).

Quote:

The fastest such evaluator I've found so far is ACE_eval, which you can also get on Github. It has a few versions, but the fastest one (ace_eval_decompress.c) uses no LUTs at all. I can get about 90M evals per sec.

I was looking at ACE when I started my project, but I just kinda ignored it, because the evaluation function seemed very long. But since it uses no lookups and the hand representation is pretty efficient it might still be competitive. I should benchmark it too.

BTW one thing worth considering about lookup tables is also how sparse they are. In my evaluator the original LUT had a total size of 36MB, but it only touched 1.7MB of cachelines and rest was empty. So even though the perfect hashing reduced it to 0.4MB it didn't have that much performance impact because the original already fit in L3 cache.

Quote:

The evaluation algorithm itself has a lot of branching, but I can't tell how much this affects performance. (Isn't a branch misprediction supposed to cost like 5ns or something? Just a bit over an L1 cache hit?)

These things are a complete mystery to me too, I just make my decisions mostly based on benchmarks.

But I have noticed situations where branch prediction makes difference. I was experimenting with generating uniform random integers using rejection sampling. E.g. if you need a number from 0 to 6, you generate 3 random bits until it falls within the range. But what really kills it is branch misprediction (or I couldn't come up with another explanation). Theoretically if the rejection sampling has 20% failure rate it should be only 25% slower than without failures, but because of branch misprediction it ends up being almost twice as slow.

Quote:

My intuition is that, with current hardware and an algorithm with little/no branching and tiny LUTS (or none at all), it'd be possible to run a few billions of Monte Carlo simulations per second on massively parallel architectures. It may be wishful thinking, but hey...

Probably not billions/s on CPU. But maybe there's a way of doing parallel simulation on GPU with shader programs.

Quote

07-31-2016 , 07:49 PM

adios

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 22,530

On using parallel architectures, what functionality could run concurrently?

Quote

07-31-2016 , 08:48 PM

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by adios

On using parallel architectures, what functionality could run concurrently?

Well, generally, you're going to do N independent simulations. I suppose those could be done nearly entirely in parallel. My implementation used multiple machines via a RPC architecture.

Quote

08-19-2016 , 11:44 AM

etale.cohomology

stranger

Join Date: Jul 2016 Posts: 6

I just finished a very demanding project for a client, and I've finally had the chance to take OMPEval for a spin.

TLDR prelimary tests: it's 2-10 times faster at (random) evaluations than the fastest one I had found so far (ACE).

---------------------
Evaluation tests (one thread of a Xeon Haswell @2.2GHz, g++ 6.1)

Test 1: SSE4 flag OFF; Default flags/code
Flags: -O3 -pthread -Wall
Timings (average over 300 runs)
Random order evaluation (card arrays):
500,000,000 evals 105 MH/s
Random order evaluation (precalculated Hand objects):
500,000,000 evals 305 MH/s

Test 2 SSE4 flag ON; Default flags/code
Flags: -O3 -pthread -Wall -msse4.2
Timings (average over 300 runs)
Random order evaluation (card arrays):
500,000,000 evals 112 MH/s
Random order evaluation (precalculated Hand objects):
500,000,000 evals 262 MH/s

Test 3: SSE4 flag OFF; changed sum += ... to sum = ..., ie. removed the running sum (why do you keep it anyway?); changed XoroShiro128Plus's seed to non-deterministic.
Flags: -Ofast -pthread -mfpmath=both -march=native
Timings (over 300 runs)
Random order evaluation (card arrays):
500,000,000 evals 108 MH/s
Random order evaluation (precalculated Hand objects):
500,000,000 evals 348 M/s

Test 4: SSE4 flag ON; rest same as test 3.
Flags: -Ofast -pthread -mfpmath=both -march=native -msse4.2
Timings (average over 300 runs)
Random order evaluation (card arrays):
500,000,000 evals 117 MH/s
Random order evaluation (precalculated Hand objects):
500,000,000 evals 301 MH/s

Test 5: AVX2 flag ON; rest same as test 3.
Flags: -pthread -Ofast -mfpmath=both -march=native -m64 -funroll-loops -mavx2
Timings (average over 300 runs)
Random order evaluation (card arrays):
500,000,000 evals 130 MH/s
Random order evaluation (precalculated Hand objects):
500,000,000 evals 545 MH/s

Notes:
I also tried g++ 4.8, and it was ~5% slower. I think. Benchmarking is hard.
Intel's guide on GCC performance flags

---------------------
Equity tests

14 physical cores, 28 logical

Using 1 thread I get average r.speed of 10
Using 14 threads I get average r.speed of 135
Using 26 threads I get average r.speed of 200

---------------------
Subsequents comments are coming, later this week.

Quote

08-20-2016 , 05:36 PM

Zekl

stranger

Join Date: Jul 2016 Posts: 4

Nice benchmarks. Are the tests 1-4 on 32bit? Didn't think it would even run with SSE without crashing, altough it should work now after some improvements I made (it's such a pain to get compilers to align __m128i correctly on x86).

Quote:

removed the running sum (why do you keep it anyway?)

That's one thing I've learned about benchmarking. If you simply discard the result of a calculation then compiler is allowed to optimize the entire code away. Another thing you can do is assign the result to volatile variable but that affects some optimizations.

Quote

08-20-2016 , 05:45 PM

etale.cohomology

stranger

Join Date: Jul 2016 Posts: 6

Quote:

Originally Posted by Zekl

Are the tests 1-4 on 32bit?

No, they're on 64-bit

Quote:

Originally Posted by Zekl

If you simply discard the result of a calculation then compiler is allowed to optimize the entire code away. Another thing you can do is assign the result to volatile variable but that affects some optimizations.

Ah, to avoid those pesky optimizations I use the volatile trick. I didn't know it had undesirable side effects. I'll keep that in mind. Thanks!

Quote

09-02-2016 , 03:06 PM

#10

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Looks nice! I didn't check in detail yet, what info do you pre-calculate in the hand objects to get that huge speed increase?

I posted a similar eval a while back with a 512kb lookup and using 64 bit ints to represent hand info. I'm getting around 150 MH/s for random evals in Java, that seems to be roughly in line with your card-array version.

Source can be found here:
http://www.holdemresources.net/misc/handeval/

The C version is just a very basic port from Java, but I didn't write C in 15 years - I'm sure you guys could improve that *lol*. Only requirement is the popcnt instruction (-mpopcnt i guess).

Quote

09-05-2016 , 11:18 AM

#11

Zekl

stranger

Join Date: Jul 2016 Posts: 4

Oh hey plexiq, I did stumble upon your thread month ago. I probably spent an hour then trying to figure out your code and I still don't know where all those shifts and xors and reduce constants are coming from. It's magic. But I adapted your C code to my benchmarks and got 290M for sequential and 156M/232M for random. Significantly beating SKPokerEval.

Quote:

what info do you pre-calculate in the hand objects to get that huge speed increase?

They're 128-bit objects with the the following structure (and i can add them together with _mm_add_epi32):

// Bits 0-31: Key to non-flush lookup table (linear combination of the rank constants).
// Bits 32-35: Card counter.
// Bits 48-63: Suit counters.
// Bits 64-127: Bit mask for cards (suits are in 16-bit groups).

The card counter isn't even needed by the evaluator, but it's useful in the equity calculator. A bit faster than doing popcount on the mask. I guess the only real "innovation" here was the suit counters. I initialize them to 3 so that when there's five of same suit the 4th bit gets flipped, and I can test for any flush by masking with 0x8888.

Quote

09-05-2016 , 12:26 PM

#12

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

The suit counter is indeed a great idea, thanks for sharing! I think that may actually work in my eval as well, I'll do some testing. I've tried working through your code a bit, but my C++ is awfully rusty.

(As for the cryptic reduce/flush constants, the "manual" part gets all non-quad hands to <=26bits and with a bit of manual tweaking that goes down to 24bit. These constants are then basically just brute forced to map down to 18bit without rank collisions. I'm sure there are more elegant ways to achieve that

)

Quote

09-05-2016 , 03:34 PM

#13

_dave_

_Pooh_Bah_

Join Date: Feb 2005 Posts: 13,146

I linked this thread in the classic 7 card evaluator thread, seems appropriate

Quote

09-05-2016 , 04:52 PM

#14

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

I played around with the suit counter a bit and get a speedup of 5-10% for my eval, it gets rid of the popcnt too. I'm still limited to a 64bit representation and can't "pre-process" the rank structure like you do, so the improvement isn't quite as drastic. (I only had 12bits spare and i couldn't use your 4th bit trick, i resolved that with a 4kb lookup though.)

Quote

Post Reply Subscribe

...