I'm also coding an equity calculator! The main requirement is to estimate equities in full-ring games, so it should have fairly strong Monte Carlo capabilities.
(I don't think writing an evaluator & equity calculator in 2016 is out of order; I've looked extensively and my conclusion is that the open source supply of such tools is lacking.)
Note: all timings that follow refer to a single Xeon core @2.2GHz.
The first stage was to find a good PRNG. I tested a bunch, and the fastest was this PCG snippet:
Code:
typedef struct{
uint_fast64_t state;
uint_fast64_t inc;
}pcg32_t;
static inline uint_fast32_t pcg32(volatile pcg32_t* rng){
uint_fast64_t oldstate = rng->state;
rng->state = oldstate*6364136223846793005ULL + (rng->inc|1u); // Advance internal state
uint_fast32_t xorshifted = ((oldstate>>18u)^oldstate) >> 27u;
uint_fast32_t rot = oldstate>>59u;
return (xorshifted>>rot) | (xorshifted << ((-rot)&31u));
}
It can run about 2 billion times each second, generating 32 random bits each time.
(Quick question. I've noticed you use XorShiftPlus and MT199937. Which do you use for what kind of stuff?)
I'm currently in the second stage, which (for me) is to find a decent evaluator. Since the idea is to get Monte Carlo equities, we need random evals, so any LUT that doesn't fit in L3 cache (say, around 30MB) is out of the question. (I'm ruling out 2+2 and RayW.)
The fastest such evaluator I've found so far is ACE_eval, which you can also get
on Github. It has a few versions, but the fastest one (ace_eval_decompress.c) uses no LUTs at all. I can get about 90M evals per sec.
Each
card is a 32-bit int. The 4 least-significant bits encode the suit, the next 2 bits are always zero, and the rest contain the rank. The whole deck looks as follows:
Code:
static const uint32 CARD_KEYS[52] = {
0x41, 0x101, 0x401, 0x1001, 0x4001, 0x10001, 0x40001, 0x100001, 0x400001, 0x1000001, 0x4000001, 0x10000001, 0x40000001,
0x42, 0x102, 0x402, 0x1002, 0x4002, 0x10002, 0x40002, 0x100002, 0x400002, 0x1000002, 0x4000002, 0x10000002, 0x40000002,
0x44, 0x104, 0x404, 0x1004, 0x4004, 0x10004, 0x40004, 0x100004, 0x400004, 0x1000004, 0x4000004, 0x10000004, 0x40000004,
0x48, 0x108, 0x408, 0x1008, 0x4008, 0x10008, 0x40008, 0x100008, 0x400008, 0x1000008, 0x4000008, 0x10000008, 0x40000008};
The cards, in human-readable form, can be recorvered from the following table:
Code:
static const char* CARD_VALS[52] = {
"2c", "3c", "4c", "5c", "6c", "7c", "8c", "9c", "Tc", "Jc", "Qc", "Kc", "Ac",
"2d", "3d", "4d", "5d", "6d", "7d", "8d", "9d", "Td", "Jd", "Qd", "Kd", "Ad",
"2h", "3h", "4h", "5h", "6h", "7h", "8h", "9h", "Th", "Jh", "Qh", "Kh", "Ah",
"2s", "3s", "4s", "5s", "6s", "7s", "8s", "9s", "Ts", "Js", "Qs", "Ks", "As"};
So, for example, 2c is 0x41, 3c is 0x101, etc.
Each
hand is then an array of five 32-bit ints, but it's built rather non-intuively, so I omit the construction.
The evaluation algorithm itself has a lot of branching, but I can't tell how much this affects performance. (Isn't a branch misprediction supposed to cost like 5ns or something? Just a bit over an L1 cache hit?)
At a high level, the branching structure of the algorithm looks at follows:
Code:
if (something)
return 4 of a kind
else if (something)
return full house
else if (something)
return full house
else
if (something)
we have a flush, continue
else if (something)
we have a flush, continue
else if (something)
we have a flush, continue
else if (something)
we have a flush, continue
if (something)
return straight (or straight flush, if we also had a flush)
else if (something)
return flush
else if (something)
return 3 of a kind
else if (something)
if (something)
return 2 pair
if (something)
return 1 pair
return high card
I've tried computing equities with Rayeval (
Github here), based on RayW's legendary LUT, but I only get about 10M simulations per second.
Anyways, I can't tell you how glad I am to see that people are still interested in this topic, since I think it's not yet quite solved! My intuition is that, with current hardware and an algorithm with little/no branching and tiny LUTS (or none at all), it'd be possible to run a few billions of Monte Carlo simulations per second on massively parallel architectures. It may be wishful thinking, but hey...
Anyways, to cut this short, I'm gonna do a deep dive of your code, but it's already looking better than the best stuff I could find.