Java 7 card evaluator performacne - Computer Technical Help

Two Plus Two Forums Other Topics Computer and Technical Help

Java 7 card evaluator performacne

Post Reply Subscribe

...

01-27-2021 , 08:47 PM

DadBot

stranger

Join Date: Jan 2021 Posts: 3

Hi,

I'm having a bit of a difficulty to find performance measurements of current Java 7 card evaluators, judging by posts it seems like it should be more than 100 MH/s, on poker-ai forums it's been measured at 144 MH/s, but I guess it was measured on ancient hardware. Does anybody know what's todays gold standard?

I'm wondering cause I coded one myself in Java, on my i7@2.69Hz it measures at 469,4 MH/s after JIT does it's magic. It does not do much, only gives equivalence values (same as Cactus Kev's - easier to test when there's something to compare).

Algorithm requires one lookup and table size is 15MB, although most of it is unused - 54K of actual values. Does it sound original or is there something similar?

Quote

01-30-2021 , 10:16 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Quote:

Originally Posted by DadBot

I'm wondering cause I coded one myself in Java, on my i7@2.69Hz it measures at 469,4 MH/s after JIT does it's magic. It does not do much, only gives equivalence values (same as Cactus Kev's - easier to test when there's something to compare).

Is that for a single thread? If so, that seems to be in the correct ballpark for modern hardware and actually quite nice given the relatively low clock speed.

(Getting approx 420MH/s for ordered and 300MH/s for random order on a Ryzen 5900x, for a Java 7-card eval with a 512kb lookup table.)

Quote

02-01-2021 , 07:12 PM

DadBot

stranger

Join Date: Jan 2021 Posts: 3

Yes, that's on a single thread. Running 7 deep nested loop to generate 133784560 combinations. I did review of my initial measurements on the same machine, I was surprised how much they can differ when small changes are made.
For instance:

when 'evaluate' method is inherited by the same class which is running the loop, its AVG = 305 MIN = 298
when 'evaluate' method is called from another class which is running the loop, its AVG = 331 MIN = 327
- but if passing int (0-52 denoting each card) rather than 'Card' object it runs much faster - AVG = 270 MIN = 235
when 'evaluate' method is in-lined (copy/pasted) in the loop, its AVG =229.5 MIN = 223

So although it's possible to achieve nearly 600MH/s (223ms makes 599,93 MH/s) but it depends not only on HW but also on small details like how the function is called, if not in-lined. I guess in worst case scenario it might be as low as 400MH/s.
How one should measure random access? It would sound like generating random values would decrease performance by itself..?

All in all it's hard to tell if I'm not comparing apples to oranges, but if someone would like to take a look or even compare real world performance, I'd be happy to share the code.

Quote

02-02-2021 , 04:38 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

You probably know this, but benchmarking in Java can be a bit tricky.

The JIT compiler will only optimize "hot" code parts after a while, so it's important to "warm up" the relevant parts before doing any measurements. If you didn't do that then this alone can easily account for the differences you observed.

On top of that you need to keep in mind that most modern CPUs have some sort of boost option for single threaded work, the performance will depend how much thermal headroom you have when you start the task. (ie, starting a short 1-threaded task when your CPU sits at 45°C will perform different than starting the same task at 65°, because the first case can boost harder/longer)

The work-around to both of these is to make the benchmarks longer and disregard the early performance.

Quote:

How one should measure random access? It would sound like generating random values would decrease performance by itself..?

Personally, I generate all 133784560 hands and store them in an array. Ordered test simply evaluates in the same order they were generated by the loop. Second test first shuffles that array and then evaluates it. So any random number generation is outside the benchmark timing.

This one is a bit outdated, but feel free to take a look at how I measure the performance there:
https://www.holdemresources.net/misc...eval-0.5.0.jar

Code:

java -cp holdemresources-7cardeval-0.5.0.jar net.holdemresources.sevencardeval.SevenCardPerformanceTest 
Enumerated: 
	4.65E8 hands/s	288ms	chksum: 215706684548
	4.32E8 hands/s	310ms	chksum: 215706684548
	4.49E8 hands/s	298ms	chksum: 215706684548
	4.47E8 hands/s	299ms	chksum: 215706684548
	4.49E8 hands/s	298ms	chksum: 215706684548
Random order: 
	3.44E8 hands/s	389ms	chksum: 215706684548
	3.22E8 hands/s	415ms	chksum: 215706684548
	3.29E8 hands/s	407ms	chksum: 215706684548
	3.29E8 hands/s	407ms	chksum: 215706684548
	3.30E8 hands/s	406ms	chksum: 215706684548

See how the first few iterations for each test are fairly unstable?

Quote

02-02-2021 , 06:39 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Oh, another important thing to keep in mind wrt Java benchmarking: You need to keep track of the combined result in some way. In my example above, the "chksum" would be the sum of all evaluated ranks.

If you just call "evaluate(hand)" but never actually use the result, then the JRE is sometimes smart enough to recognize that it can skip some of the "evaluate(..)" work without any side effects.

Quote

02-02-2021 , 05:03 PM

totalsoccer

newbie

Join Date: Nov 2015 Posts: 19

Great question. What I found in my own internet "research" is that the vast majority of claims about poker hand evaluators are vague and/or false. There's only one correct and final answer to your question:

Quote:

Originally Posted by Andrew Prock

If people want to make comparisons, it's useful to agree on what you're comparing.

- Andrew

Some yes/no questions you need to consider:

1. Is the input to your function an array of length 7 of ints between 0 and 52 (exclusive)?
2. Can you see from the return value what is the type of hand made? (high-card, pair, etc.).
3. Does your algorithm use a single thread?
4. Is every subsequent input random?
5. Does your algorithm break ties and keep actual ties?
6. Does your algorithm return which cards are the 5 of the "made hand"?

And less important:
A. How much memory does your program use?
B. How much time does it take to set up the lookup tables?
C. Which programming language does your program use?

The hardest case is when all of the 6 answers are "yes". But that doesn't need to be the most useful case: that depends on the application that uses it.

There's a lot of information scattered on the internet. The famous "THE 2+2 thread" where the "seven array lookups algorithm" was born is where the above quote was posted. It is included in a benchmark here: https://github.com/christophschmalho...ter/XPokerEval. On my laptop it does 600 MH/s, but that only makes sense when including the answers to my questions:
1. Yes
2. Yes
3. Yes
4. No (Subsequent hands of the benchmark are sequential with usually only 1 card difference and that reduces the time immensely. Computations are literally reused).
5. Yes
6. No (It can not be derived which are the 5 cards)

And furthermore:
A. ~130 MB
B. Don't know, the lookup table is supplied as a file that's read into memory.
C. C++

I'm writing a GTO solver and found that the 7-card evaluation is not the performance bottleneck. I think that's why attention to 7-card evaluators decreased the past decade. My Java 7-card evaluator (with answers 1:No, 2:Yes, 3:Yes, 4:No, 5:Yes, 6:No; A: 100MB, B: 2s, C: Java) does 60 MH/s (which is already an achievement) on my laptop when run with Java 17: https://github.com/alberthendriks/lpokerbot

Quote

02-08-2021 , 05:56 PM

DadBot

stranger

Join Date: Jan 2021 Posts: 3

1. Yes, I've noticed that when passing 7 Card instances to a function it incurs quite a performance penalty. Although to make fair comparison, maybe not all evaluators use integers..
2. Yes and no. It returns save equivalence values as Cactus Kev's 5 card evaluator. So for example ranges for King high are known and hand type can be deduced.
3. Yes
4. I've tried shuffling array of 133784560 length. And iterating over it rather than nesting loops and performance dropped 10 fold. I assume that it was more due to cache misses when reading from the array and subsequently lookup table records might be evicted from L1/L2. Besides algorithm requires only one lookup rather then 7, similar hands have different hashes (lookup table index differs) so I don't think that CPU preloads memory somehow. In any case that's my speculation, I'll have to figure out how to isolate randomness performance from lookup performance.
5. Could you explain that one?
6. Yes and no, face values can be deducted from equivalence value, but that would require extra processing. On the other hand exact suit is determined when calculating flush equivalence value.

A. ~15MB
B. ~2.4s
C. Java

I might say that best result that I've managed to get on my laptop was near 600MHs, but since it depends on room temperature and boost.. difficult to compare. That's with checksum.
Although note regarding checksum - I can't comment about other languages or different Java versions, but in my case adding/removing checksum calculation does not impact performance dramatically so I assume that compiler does not optimize out lookup itself or related computations.

GTO solver sounds very interesting, are there any open source? What's the biggest bottleneck then?

Quote

02-09-2021 , 05:22 AM

plexiq

veteran

Join Date: Apr 2007 Posts: 2,554

Quote:

Originally Posted by DadBot

4. I've tried shuffling array of 133784560 length. And iterating over it rather than nesting loops and performance dropped 10 fold. I assume that it was more due to cache misses when reading from the array and subsequently lookup table records might be evicted from L1/L2. Besides algorithm requires only one lookup rather then 7, similar hands have different hashes (lookup table index differs) so I don't think that CPU preloads memory somehow. In any case that's my speculation, I'll have to figure out how to isolate randomness performance from lookup performance.

Aside from additional cache misses, branch prediction will also perform considerably worse with random order. If you are using multiple "if" statements in your evaluator then you can expect some performance degradation for random order, it certainly shouldn't cause a 10x drop though.

Do you get the same drop if you don't shuffle the array?

Quote:

I'm writing a GTO solver and found that the 7-card evaluation is not the performance bottleneck. I think that's why attention to 7-card evaluators decreased the past decade.

Very much this btw. It's an interesting coding challenge, that's why people keep re-inventing this particular wheel. In practice it's honestly not going to matter for most applications/solvers if you do 100MH/s or 1000MH/s per thread, hand evaluation makes up a negligible part of the total runtime anyway.

Quote

Post Reply Subscribe

...