Open Side Menu Go to the Top
Register
Hand History Database for Research (Beta) Hand History Database for Research (Beta)

04-04-2009 , 02:35 PM
Note: This is pre-announcement, so if you decide to go for this at this point, you would partially be a "beta-tester" for the whole process, software, and data involved.

In a nutshell

We are happy to announce that we are giving away free, for research purposes, nearly one billion real money poker hand histories, played on some major PokerSites this and last year, plus supporting software to read these.

In details

We want to provide these hands strictly for research. Therefore, one has to be published author of at least one Poker AI conference paper, and it should be possible for us to verify that, in order to be eligable to obtain these hands. Please PM me here, or e-mail me at findbg@gmail.com for further details.

The billion hands database contains Limit, Pot Limit and No Limit Texas Hold'em and Omaha cash games hand histories, limits from NL2 to NL100000. The inflated size of these hands is in the range of one terrabyte. Therefore the hands are offered in proprietary parsed format, together with Java library for reading the hands (possibility to export them in plain text will come later). The source code of the Java library are offered for free as well (Under GPL v3).

This format, as well as our own tools makes it possible to run analysis over hundreds of millions of hands on a mainstream PCs, where otherwise one might need to spend upto hundred thousands bucks for setup able to handle the same task via conventional RDBMS.

We support no opponent profiling policy. Therefore we are taking measures to prevent usage of this database for opponent profling. We are obfuscating the name of the pokersites from which these hands were obtained, tablenames, player Ids and hand Ids. We are also randomly modifying the time the hand was played (time is shifted with difference of some seconds, to still make possible the extraction of time-dependant player patterns).

Note: The end user licence agreements (EULA) of some of the poker sites from which these hands were obtained are against datamining. We have inquired these pokersites for permission to redistribute these hands in the described manner. We have no response from all of the sites at this time, but we have not been rejected either. Despite that we never formally and techically agreed to the EULAs of these pokersites, we still believe we did what is possible to comply with the intention and spirit of this EULA and we will further cooperate with the poker sites if they require us to do so, to eliminate doubts, if any, that the hands will be used exclusively for research purposes, but not to augment real money play.

If you chose to apply, please fillout the below form, and send it back by PM or e-mail.

Application process

Please provide the following information.
  • Name:
  • University:
  • Position:
  • E-mail /university email/:
  • Paper published /need to respond to verification send to Author's e-mail of that paper/:
  • Home page:
  • Purpose of request:
Alternative ways to verify you academic credibility would work as well, but will not be lighter than the above.

Please, also indicate that you agree with the following terms:
  • All hand histories are provided to you for personal use. You must not redistribute them to third parties under any circumstences. You may use them personally without restrictions or obligations (We would be happy if you cite pokerai.org/pf3 as source of these hands in your academic work).
  • All software for parsing the hand history database is provided under GPL v3 license. Any redistribution is bounded by this license.
I agree to these conditions/Type yes, and your first name as signature/:

*****

Regards,
Indiana && PeppaPig
04-04-2009 , 05:09 PM
Indiana, are you actually affiliated with a research institution? Here I thought you were just a botter.
04-05-2009 , 04:31 AM
I am, but this has not relation to the above initiative, or poker in general.
04-08-2009 , 04:04 PM
Here is an example of what kind of things you can easily do (it took us just 10 minutes to implement this example).
This and other examples are just part of the software distribution.

04-09-2009 , 01:28 PM
Interesting.

1. Why must one be published? What if you're a retired computer programer who's interested in poker and likes playing with large amounts of data?

2. Do the hand histories include player chat? (Observer chat is not as interesting.)

3. Is there any API other Java? (perhaps C or, dare I say, Fortran?)

4. Is your canonical HH format available? Do you have a transform into XML? (Not the proprietary compressed format, but what it expands into.)

5. Does the database of player names include country?

6. Do you also have HHs from tournaments? If so, will that be forthcoming?
04-09-2009 , 03:57 PM
Quote:
Originally Posted by sapientia
Interesting.
1. Why must one be published? What if you're a retired computer programer who's interested in poker and likes playing with large amounts of data?
2. Do the hand histories include player chat? (Observer chat is not as interesting.)
3. Is there any API other Java? (perhaps C or, dare I say, Fortran?)
4. Is your canonical HH format available? Do you have a transform into XML? (Not the proprietary compressed format, but what it expands into.)
5. Does the database of player names include country?
6. Do you also have HHs from tournaments? If so, will that be forthcoming?
1- We want to make sure that these hands are used only for research, and not to augment real money poker play. Being a published author sufficiently satisfies this. We might come with alternative ways to verify this, but this has to make it reasonably sure that it is the case. Being retired programmer isn't good enough for that, apologies for this.

2- No. Somewhere in the future we can include that separately, but for now we don't see good reason to do it.

3- It is only Java for now. I will invite that gets the software to work on a C# port.

4- It will expand to format used by popular sites, or just proprietary one that looks like one used by popular sites. Anything else (XML, etc.) is an option as well.

5- No. It does not even include player names. Players names are obfuscated to numbers from 1 to 1000000+.

6- No. Might be forthcoming, but for the next few weeks/months it is cash games only.
04-09-2009 , 07:22 PM
Quote:
Originally Posted by indianaV8
1- We want to make sure that these hands are used only for research, and not to augment real money poker play. Being a published author sufficiently satisfies this. We might come with alternative ways to verify this, but this has to make it reasonably sure that it is the case. Being retired programmer isn't good enough for that, apologies for this.
Well, first of all, I don't play cash games. Besides, given all the obfuscation, I'm not sure how the HHs can be used to augment real money play. Perhaps those doing research using this data, currently published authors or not, could provide results for publication on pokerai or some other poker research site -- maybe the Alberta group would be interested in hosting results.

Quote:
Originally Posted by indianaV8
2- No. Somewhere in the future we can include that [chat] separately, but for now we don't see good reason to do it.
But this is an ideal research topic. How does chat, presence and quantity, correlate with chatter's results? How about with chattee's results? How much does quantity of chat correlate with aggression? Does a prolific chatter spur chat from the rest of the table? And if so, does that cause a player who's not prone to chat to lose? Etc. One thing I've learned from 20+ years of research -- never throw data away or make it difficult to get at, always have it at hand.

Quote:
Originally Posted by indianaV8
3- It is only Java for now. I will invite that gets the software to work on a C# port.
ok.

Quote:
Originally Posted by indianaV8
4- It will expand to format used by popular sites, or just proprietary one that looks like one used by popular sites. Anything else (XML, etc.) is an option as well.
ok.

Quote:
Originally Posted by indianaV8
5- No. It does not even include player names. Players names are obfuscated to numbers from 1 to 11000+) 000+.
Right. But is there a table with player_id (number from 1..1000000+) and known player info -- country, first date played, last date played, number of sessions, number of hands, each of the previous by blind level, number of sites played on, etc.

Quote:
Originally Posted by indianaV8
6- No. Might be forthcoming, but for the next few weeks/months it is cash games only.
ok. I hope you have tourney results, so as to map player_id to standings and winnings.
04-09-2009 , 08:30 PM
1- The problem here is not to cluster the types of research (or researchers) but to come with reliable ways to verify the usage of these hands. I will eventually think how to enable this in the future to more people (e.g. provide people with the software and small sample database, and if they want to develop examples that run over the full database, they have to submit them, and we'll get back the results to them).

2- I agree, this is good point. There are however further issues with distribution of the chat (privacy, encoding of hands becomes much bigger - currently we encode one hand in 75 bytes on average). Maybe I can think of providing summary information about the chat per player.
04-21-2009 , 09:38 AM
Quote:
Originally Posted by indianaV8
Here is an example of what kind of things you can easily do (it took us just 10 minutes to implement this example).
This and other examples are just part of the software distribution.

Indiana,

Great cause, and a fine solution to present data without (IMO) breaching the EULAs.
Could you please post the same chart for profitable players.
To define profitability you can simply choose [monthly p&l]>zero or [yearly p&l]>zero, the yearly one is much more reliable of course...
I have been reading 2+2 but never posted before, perhaps it is time I start to...

Thank you
SJ
04-21-2009 , 02:52 PM
Quote:
Originally Posted by sloppyJohn
Indiana,

Great cause, and a fine solution to present data without (IMO) breaching the EULAs.
Could you please post the same chart for profitable players.
To define profitability you can simply choose [monthly p&l]>zero or [yearly p&l]>zero, the yearly one is much more reliable of course...
I have been reading 2+2 but never posted before, perhaps it is time I start to...

Thank you
SJ
Hi,

Yes, I can. In fact I had this in mind, but I am delayed for various reasons, I still plan to do this.

To find "profitable" players is not that easy to define. Many many (most) of the people play very little hand - so this is close to 50/50. If you take just the long term players (that played over 100k hands, e.g.) you can argue that players will play that many hands only if they are winners - and you get again buggy statistics.

So what you can do? I'm sure this was discussed already on 2+2 but I don't have the time to dig it out. If someone points me or summarize it what is the best way to calculate winning players %, I can do that.

Otherwise, what I came with is the following:
1) Graph for players clustered to how many hands they played (this has all the issues discussed above)
2) Amount of hands played by winning players as % of all hands played. I'm not sure if this is improvement and solves the above issues, as it does not take into account that for small amount of played hands there are many small winners and approx the same amount "slightly bigger" losers.
3) % of players that won 95% of all money won. For example, if we know that 10% of the players won 90% of the money, that's something, although I don't believe this would be the figure. You still have the "long tail" of winning players that played little hands

Finally - some combination of the above. E.g. the above approaches, but on players that have over 2k, or 5k, or 10k hands, to ensure at least some statistical significance.
04-22-2009 , 01:54 AM
What are you researching exactly?
04-22-2009 , 03:51 AM
How about the following definition:
If someone has been successful in maintaining profit for over 50k hands they are considered profitable (this includes breakeven player who apperantly end up earning rakeback). this is a soft definition and will include errors, for example someone with 1$ profit after 50k hands, which he got after earning 1,500$ in one hand. but I think that for the purpose of understanding the behaviour of profitable players it sufices.
A more percise solution could include relative BB profit. for this we need to calculate avg. BB for each player ((∑#hands x each BB ever played)/Total #hands). than we can realize common rakeback (~27%) per avg. BB and define a losing/breakeven/profitable player by BB/10,000 hands in categories of avg. BB sizes.

I really feel that the first option is good enough for this goal.
What do you think?
John

P.S.
If you accept the soft definition of profitable players, this will also figure up the estimation of profitable players % in the population of players
04-22-2009 , 02:29 PM
Just to understand, you want % of profitable players that played over 50K hands divided by all players with over 50K hands? Or it is rather the profitable players over 50k hands as part of the total players (no matter how many hands they played)?

Keep in mind that the amount of players that played less than 50k hands is much more than the one that played more than that (for the samples that I checked so far).

@Lego05 - Why do you ask?
04-22-2009 , 03:03 PM
this is extremely interesting. however, on the sites you're tracking, what percentage of hands are you tracking? what sites are you tracking? since when have you been tracking?

thanks
04-23-2009 , 04:59 AM
Indiana,

these are two different searches:
1) # hands profitable players play on avg. every month
2) something that allways intrigues the industry, % profitable players out of the poker players population

I think the begining to both of these is in defining a profitable player, statistically wise.

John
04-23-2009 , 02:03 PM
Quote:
Originally Posted by indianaV8

@Lego05 - Why do you ask?

Curious.
04-23-2009 , 02:32 PM
Quote:
Originally Posted by sloppyJohn
Indiana,

these are two different searches:
1) # hands profitable players play on avg. every month
2) something that allways intrigues the industry, % profitable players out of the poker players population

I think the begining to both of these is in defining a profitable player, statistically wise.

John
OK, so these 1 and 2 are the questions, and how do you define 'profitable' player - one with EV>0 for over 50k hands? And non-profitable is the everyone else (no matter how many hands? - so EV+ player with 40000 hands will be non profitable)?
04-23-2009 , 03:18 PM
Hi,
nice idea!
Random thoughts on that:
- Making data available is a Good Thing(tm). With sites like sharkscope or officialpokerrankings having tons of data available, I think there's nothing bad in making this data available to everyone.
- "your privacy is an illusion, get over it" (Sun CEO some years ago, don't have name ready atm) - anonymizing data is incredibly hard, just google "AOL query logs" for the scandal involving (admittedly poorly) anonymized search logs that got AOL's CTO fired, as well as "query log anonymization" for a number of follow-up papers showing how even more sophisticated anonymization approaches can easily (with varying definitions of "easily", of course) be defeated. This is not to discourage you or anyone; just be prepared that your anonymization approaches will be valuable for showing your good intentions, but probably not for keeping posts like "player x is really Blabla on FTP" from appearing.
- Information tends to diffuse. I'd be surprised to see the data staying with whom you send it to (of course that chance increases the less people you send it to)
- Having said all that, I agree with some of the previous posts that your distribution policy seems a bit restrictive. In fact, a large part of your target audience might already have such a corpus; I know at least UAlberta have a large corpus of played hands (once saw a talk by their boss Schaffer; he said it was largely useless, but that was relative for their goal of beating world class players). I think many people here might benefit from this and be able to create interesting results to share, without being full-time poker researchers (I'm being a bit partial here of course, being a PhD student in non-poker-related machine learning

Just my two cents, in any case I think it's a great idea and a great thing to have accomplished already...

Tom
04-23-2009 , 04:04 PM
Hi Tom,

I know it is hard to anonymise things. If one is determined enough, he could eventually manage to find out who is particular player, but I want to make it 1) Hard enough that single player is determined
2) Impossible that the database as a whole is used together with HUDs, etc.

I know that point 1) alone might be concerning - and this is one of the reasons why I decided to stick to just offering this now to published authors. That massively increase the probability that the DB will be used for research of poker and poker AI, and not real money.

I don't want to remove this condition for now (or not have anything close to that). Anyway, any comments or suggestions on the anonimization side are welcome.

I believe this database and the software to work with it is massive improvement over anything that already exists. I encode single hands in under 80 bytes. So 1 billion (or any amount of hands) is encoded in very little space, so you already today, or very soon can even fit that into the main memory for extremely fast queries.

I don't think any tool being it PT3, HM or anything can handle such big amount of hands, and query them all in matter of minutes (or under minute), and thats my goal, and to offer that to researchers.

I was already contacted by several universities, but all in all this is picking up slowly. I ain't in a hurry anyway (and have limited time for this), but anyway I hope long term this will generate interesting things.
04-23-2009 , 06:53 PM
Indianna, it is people like you who will be the end of online poker

How have you got 1 billion hands to distribute, did you play 1,000,000,000 hands all by yourself at all those limits?

As with botting, I'm sure your obtaining of this information was against all poker sites rules.

The only thing gained from this database would be info on other players so you could know them inside out.

I mean, what are you researching? What can be gained here, there is no genuine reason why...well I don't even know because I fail to see anything interesting to be gained from this 'research'...

I mean what is gained if you use this info to find out the average % of players who 3 bet 5% or more? What is gained from these stats? Nothing.

Other than knowing a hell of a lot about a lot of players you have never even played against before.

Last edited by TPTK22; 04-23-2009 at 06:55 PM. Reason: This guy should be banned already
04-23-2009 , 07:40 PM
@TPTK22 - If we wanted to do what you describe, we could have done that without coming to 2+2, and offering all this to other researchers only, and taking all these efforts to obfuscate the hands, and so on? Isn't that simple logic to make?

And it is completely irrelevant if I use this myself and for what. I don't use it to profile opponents (that's the truth) - but this is even completely irrelevant.

As of what kind of research - e.g. research on how to identify collusion amongst players, how that sounds? I don't do this research myself, I just give it as an example.

Last edited by indianaV8; 04-23-2009 at 07:55 PM.
04-23-2009 , 08:20 PM
Quote:
Originally Posted by indianaV8
To find "profitable" players is not that easy to define. Many many (most) of the people play very little hand - so this is close to 50/50. If you take just the long term players (that played over 100k hands, e.g.) you can argue that players will play that many hands only if they are winners - and you get again buggy statistics.
The usual method is to ignore the question of which of the low-frequency players are winners, and concentrate on estimating the number.

Say you had 10,000 players who played only one hand each. You compute over your entire sample for each type of game (game, limit structure and any other variable that seems to matter) the standard deviation of player outcome per hand. For example, suppose $5/$10 limit Hold'em has a standard deviation of $15 per hand (an individual player would have a lot of zeros, a lot of $5 and $10 losses, a bunch of small losses and gains and a few big losses or gains).

You standardize each of the 10,000 results by dividing by the standard deviation estimated for that game. You assume the expected profit (in standardized units) by player has a Normal distribution with unknown mean and standard deviation, and that the actual outcome has a Normal distribution with known standard deviation one (since you already standardized). Now you can easily estimate the parameters for the prior Normal, and that tells you the fraction of profitable players.

If each player has more hands, you just standardize their results into a number with standard deviation one.
04-24-2009 , 10:35 AM
Quote:
Originally Posted by indianaV8
@TPTK22 - If we wanted to do what you describe, we could have done that without coming to 2+2, and offering all this to other researchers only, and taking all these efforts to obfuscate the hands, and so on? Isn't that simple logic to make?

And it is completely irrelevant if I use this myself and for what. I don't use it to profile opponents (that's the truth) - but this is even completely irrelevant.

As of what kind of research - e.g. research on how to identify collusion amongst players, how that sounds? I don't do this research myself, I just give it as an example.
I just base it on the fact you believe botting is acceptable and you openly admit that you use bots so when you say you have a 1 billion hand database (in beta) which you want to pass on for 'research purposes' then what do you expect?

What field of researcher are you looking for? I wasn't aware that there are many fields of research into the very detailed statistics of online poker players...but what do I know?
04-24-2009 , 01:54 PM
@AaronBrown, I didn't get how you do that.

You make assumption that the EV of all players is normal distribution (that's the first thing which I don't get why).

Then you produce this normalized array of EVs (divided by STD) to get to std 1, and some mean X. This operation does not change the number of EV+ / EV players, so why bother to work with this new array of normalized values?

Finally, on the "getting back" step, what exactly has changed? The EV, the std, and the distribution of the EVs of the original data you could have already determined at first place.

So all in all, I'm afraid I didn't got your idea. If you are coding, I would appreciate if you just post it in pseudo-code.
04-24-2009 , 10:05 PM
Indiana,

there's a thread on this forum about which cards (if any) flop more often than others. I was hoping you could give us a quick breakdown of how often you see each card on the flop (just the rank, suits not necessary). Maybe just for 6max? You're the only one I know with access to enough hands to settle it.

      
m