Continuous Decision Prisoners Dilemma - Science, Math and Philosophy Forum

Two Plus Two Forums Other Topics Science, Math, and Philosophy

Continuous Decision Prisoners Dilemma

Post Reply Subscribe

...

Page 1 of 2

1 2

Page 1 of 2

1 2

05-25-2015 , 08:48 PM

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

With the passing of John Nash the idea came to me today for a different kind of Prisoner's Dilemma. Suppose that instead of the Prisoners making a discrete decision in secret, they make variable continuous decisions in full view of each other. In fact, not only in full view of opponent's variable decision up to the present time, but with an infinitesimal look ahead to an imminent upcoming change in opponent's variable decision. The payoffs would accumulate based on the standard PD payoff table weighted by the time intervals for the states produced by their variable continuous decisions.

It seems to me the stable solution is for both prisoners to continuously cooperate. Each prisoner knows if he switches to a time interval of defecting the other prisoner will instantly switch along with him and they will both accumulate the worst payoff-time. Any attempt by either to accumulate defect-cooperate max payoff-time for himself will only result in more min payoff defect-defect payoff-time for both of them.

It also seems to me that this models the 3 average girls and 1 beautiful girl for 3 guys movie scenario. If at any time any defects and looks like he's going for the beautiful girl they all defect and start the min payoff-time of all girls losing interest. They all know this so they all stay with the average girls.

I would also like to argue that this amounts to a kind of limiting case for an iterated prisoners dilemma where the time intervals between iterations tends toward 0. Trouble is, the iterated prisoners dilemma doesn't have the infinitesimal look ahead feature. But what if, with decisions up to the present still in full view, the infinitesimal look ahead feature is dropped from the continuous decision model? Would the stable solution then be some kind of tit-for-tat Brownian motion type pair of variable continuous decisions? Or might it still be to always cooperate?

PairTheBoard

Quote

05-26-2015 , 12:48 AM

TomCowley

Pooh-Bah

Join Date: Sep 2004 Posts: 5,649

Without the look-ahead, is that any different than a normal repeated PD? (cooperation persists if neither know when the game is going to end). Other scenarios could depend on exactly how you define the reaction time, etc.

Quote

05-26-2015 , 06:54 AM

BrianTheMick2

Long way to go and a short time to get there.

Join Date: May 2012 Posts: 19,410

Quote:

Originally Posted by TomCowley

This scenario is effectively zero reaction time, right?

Quote

05-26-2015 , 10:44 AM

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

Of course this is all assuming the mathematics for this idea actually works, which I'm afraid might be problematic. The existence of continuous time Brownian Motion requires the application of powerful existence theorems for a stochastic limit of a sequence of finer and finer random-walks as I vaguely recall. It's not an easy thing to do at any rate. I'm thinking the model for this would look like some kind of 2 dimensional filtration or 2 jointly measurable filtrations or something like that. I might be asking the impossible. It's beyond my expertise anyway.

But assuming it can be done mathematically, it might not matter if there's a time limit. The problem with the iterated PD with a known time limit is the last iteration. That becomes a one-off PD with defect-defect and then you logically step back from there. But with this variable continuous decision model you don't really have a "last moment" to grab hold of to start that logical chain.

PairTheBoard

Quote

05-26-2015 , 10:45 AM

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by BrianTheMick2

This scenario is effectively zero reaction time, right?

Yea, that's the idea.

PairTheBoard

Quote

05-26-2015 , 11:39 AM

Zeno

Le Misanthrope

Join Date: Sep 2002 Posts: 22,370

Quote:

Originally Posted by PairTheBoard

Of course this is all assuming the mathematics for this idea actually works, which I'm afraid might be problematic. The existence of continuous time Brownian Motion requires the application of powerful existence theorems for a stochastic limit of a sequence of finer and finer random-walks as I vaguely recall. It's not an easy thing to do at any rate. I'm thinking the model for this would look like some kind of 2 dimensional filtration or 2 jointly measurable filtrations or something like that. I might be asking the impossible. It's beyond my expertise anyway.

But assuming it can be done mathematically, it might not matter if there's a time limit. The problem with the iterated PD with a known time limit is the last iteration. That becomes a one-off PD with defect-defect and then you logically step back from there. But with this variable continuous decision model you don't really have a "last moment" to grab hold of to start that logical chain.

PairTheBoard

Back in the Ordovician, I bumped my head against the lamppost of Random Walk Theory. It was important to try and analyze some process we were trying to model.

These so-called "self-avoiding random walks" are used in numerous physical models including polymer chains, protein folding and Brownian motion. See link below:

Self-Avoiding Random Walks

Someone here should be smart enough to figure out if this is all nonsense with regard to PD or if there really is something to it.

Also, does god have a stop watch?

Last edited by Zeno; 05-26-2015 at 11:51 AM.

Quote

05-26-2015 , 02:09 PM

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by Zeno

Someone here should be smart enough to figure out if this is all nonsense with regard to PD or if there really is something to it.

Assuming it's mathematically viable, if it works with a known fixed time like the iterated PD does without a fixed time, then I'm thinking that, in action, it would "look" very much like a regular one-off PD except the stable solution would be cooperate-cooperate instead of defect-defect because the zero reaction time keeps the players honest with each other. It seems to me that could have interesting and maybe significant implications.

It's a common theme in mathematics to take a discrete model and pass to a continuous limiting version where the calculations are easier and can be proven to approximate the discrete version. I believe this is what they did with Black-Scholes, taking a trading scheme and looking at the continuous time stochastic limit under greater and greater trading frequency.

I've always thought there was something fishy with the "who's got the blue eyes on the island" type logic that sort of ruins the known fixed time iterated PD. It just doesn't ring true to life for me. Maybe it's because it's the wrong model. Maybe the variable continuous decision model is better.

I know there's been people who have posted here who could handle this. They know who they are.

PairTheBoard

Quote

05-26-2015 , 04:03 PM

TomCowley

Pooh-Bah

Join Date: Sep 2004 Posts: 5,649

Yeah, it's not immediately obvious to me how to pass to the limit here in a meaningful way. It doesn't matter how many iterations or how small the intervals are, with an actual number of iterations of actual value and a known ending, the standard backwards induction works. In the continuous model with a known ending, there's no induction, and with instant reaction time, I don't see how to exploit the strategy of F(0)=cooperate and F(t)=if opponent has ever defected on [0,t), defect, else cooperate. You can pick up measure-zero "equity" by defecting at a point, but unless it's the endpoint, you're costing yourself in the future, and if it is the endpoint, it's meaningless because it's measure-zero (and finitely negative payoff per unit time).

Quote

05-26-2015 , 04:17 PM

TomCowley

Pooh-Bah

Join Date: Sep 2004 Posts: 5,649

I think the issues with using the model in real life are a combination of people not having perfectly malleable strategies in all situations, the ability to choose future partners to some degree based on reputation/past experience, and not knowing in most cases how long the interactions are going to last, or the overall game. The strategy for a game like that, in handwaving terms, is for most everybody to cooperate at least most of the time. When you start screwing with the constraints like not allowing people to shun defectors and allowing reputation to be masked or rebought too cheaply, it can all go to hell in a hurry.

Quote

05-26-2015 , 05:12 PM

#10

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

Yeah, it's not immediately obvious to me how to pass to the limit here in a meaningful way. It doesn't matter how many iterations or how small the intervals are, with an actual number of iterations of actual value and a known ending, the standard backwards induction works. In the continuous model with a known ending, there's no induction, and with instant reaction time, I don't see how to exploit the strategy of F(0)=cooperate and F(t)=if opponent has ever defected on [0,t), defect, else cooperate. You can pick up measure-zero "equity" by defecting at a point, but unless it's the endpoint, you're costing yourself in the future, and if it is the endpoint, it's meaningless because it's measure-zero (and finitely negative payoff per unit time).

Right. Running ideas up the flagpole here, but sometimes when passing to a limit it produces a discontinuity. For example, when passing to a limit at the probability measure induced at infinity for fair coin flips with a bet-it-all betting scheme, the constant 0 EV at every finite time jerks down to a -1 EV at infinity. So maybe just accept that the strategy changes at the limit. Not sure how that would work though.

Another idea is that maybe taking the limit of known fixed time finer and finer iterated PD's is the wrong approach. You might instead take the limit by somehow "compressing" infinite time iterated PD's down to a finite fixed time continuous decision PD. A kind of compactification maybe. idk.

Or maybe some more imaginative approach working on the space of finite random time iterated PD's.

PairTheBoard

Quote

05-26-2015 , 05:21 PM

#11

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

In real life I find it hard to believe in a 2 player fixed time iterated PD with an extremely high frequency and large number of iterations that people are really going to buy into thinking about the last iteration and the chain of logic it implies when treating it like a random time IPD for what's right in front of them has the potential to make so much more payoff if their opponent sees it the same way.

PairTheBoard

Quote

05-26-2015 , 08:51 PM

#12

BrianTheMick2

Long way to go and a short time to get there.

Join Date: May 2012 Posts: 19,410

Quote:

Originally Posted by Zeno

Someone here should be smart enough to figure out if this is all nonsense with regard to PD or if there really is something to it.

There isn't. It effectively becomes a cooperative game or something about equivalent local lows being equivalent lows.

Quote:

Also, does god have a stop watch?

We all have access to one.

Quote

05-26-2015 , 09:57 PM

#13

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

I don't think the game having a cooperate-cooperate stable solution makes it inherently less interesting than the PD with its defect-defect solution. Especially with the two having at least an appearance of being closely related.

There's been considerable interest in how to turn a dog eat dog law of the jungle world into one of mutual cooperation. You might say the variable continuous decision PD is the civilized version of the prisoner's dilemma.

PairTheBoard

Quote

05-26-2015 , 10:22 PM

#14

TomCowley

Pooh-Bah

Join Date: Sep 2004 Posts: 5,649

Quote:

Originally Posted by PairTheBoard

Sure, over a large sample like that, the general goal isn't to be unexploitable, it's to make money (assuming getting pwned once by a defector isn't some hugely bad outcome), and there's the assumption that some decent percentage of the population is going to cooperate with you, so it's reasonable to try to cooperate all the way and it's probably actually advantageous in the aggregate (you get small losses relative to GTO when you find a defector and big wins when you don't). So let's even pretend that we run this enough that all the initial defectors get weeded out, and somebody randomly evolves the strategy of cooperate until the last, then defect. Clearly that guy has an edge, and it may eventually become common knowledge that you punt the last one, but everybody cooperates to N-1.. and then somebody evolves the defect at N-2, etc. We're clearly not at the endgame where everybody assumes they're going to get screwed at every step, and a related 1-shot game was tested AMONG PEOPLE WHO KNEW THE GTO STRATEGY and they still mostly "cooperated" even though they knew that wasn't GTO (and crushed the defectors).

It's clear that your perception of the right strategy is clearly a function of your perception of what other people will do, and that there's no inherent reason to believe that it will be GTO even if you know they know GTO.

I also don't think the martingale example is relevant- that's just a terrible model for the actual situation because it ignores (by design, not ad hoc) that the negative payoff increases as fast or faster than the probability of it decreases. I think a better way to take the limit, is asking what the maximum value lost compared to GTO of Cooperate first, then cooperate until opponent defects is as the intervals get smaller and smaller, and THAT limit actually does approach 0 (because you can't get pwned for more than 1 decreasing iteration) and matches the continuous case in the limit. But then you have a discontinuity in the value of always defect, so wtf knows.

Quote

05-26-2015 , 10:50 PM

#15

Olangotang

banned

Join Date: Oct 2009 Posts: 1,482

With the passing of Nash the idea came to me that his equilibrium will and can be broken, as you explained. I am sorry if I'm Delusional

Quote

05-26-2015 , 10:58 PM

#16

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

I also don't think the martingale example is relevant- that's just a terrible model for the actual situation because it ignores (by design, not ad hoc) that the negative payoff increases as fast or faster than the probability of it decreases.

Not finding any disagreements with your post except I didn't understand what you were referring to here. What "martingale example"? I gave a "bet it all" example for fair coin flips but that had nothing to do with the PD other than motivation. You always bet your entire current bankroll. With a starting Bankroll of 1 the EV after any number of finite plays is 0 whereas in the limit at infinity the EV jumps down to -1. Of course in practice the EV isn't what's important. What matters is the probability of being bankrupt after n plays which tends to 1 and is continuous at infinity. That's the whole point of Kelly betting.

So with that motivation in mind my thought was that a discontinuity in the limit for the last-next to last-... defect-defect logic chain solution of the fixed iterations IPD may not be that big a deal.

PairTheBoard

Quote

05-26-2015 , 11:22 PM

#17

TomCowley

Pooh-Bah

Join Date: Sep 2004 Posts: 5,649

Quote:

Originally Posted by PairTheBoard

Yeah I just skimmed and assumed it was the usual martingale but it has basically the same property (an arbitrarily large positive payoff that grows as fast as the chance of receiving it decreases which the measure-theory analysis isn't a good model of IMO). In the PD passing you're literally making the value of defecting equal to 0 in the limit, and at least for some class of strategies than can be expressed discretely and in the limit, the relative values seem to be continuous and it's the GTO strategy designation that changes wildly. It's not strange for THAT to happen, but it is kind of odd, at least as far as I've seen, for the GTO VALUE to change that much.

Quote

05-27-2015 , 09:18 AM

#18

river_tilt

old hand

Join Date: Apr 2006 Posts: 1,969

http://robboyd.abcs.asu.edu/LeBoydContPDJTB.pdf

From the abstract, it is difficult for cooperative strategies to invade a non-collaborative equilibrium; and this effect is greater than in a discrete time version.

Quote

05-27-2015 , 01:47 PM

#19

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by river_tilt

This looks like a completely different idea and different model from what I propose.

From the abstract:
-------------------------
The iterated prisoner’s dilemma (IPD) has been widely
used in the biological and social sciences to model dyadic
reciprocity. In the discrete version of the IPD, during each
interaction players engage in a standard prisoner’s dilemma
game in which they have only two options in each iteration:
cooperate or defect. In continuous IPDs, in each iteration
players’ contributions vary along a continuum ranging
from pure defection to pure cooperation. When a player
increases her contribution level, her payoff decreases, but
the average payoff of the pair increases.
------------------------

So they are still doing discrete iterations of PD's. But when making his decision in one of the discrete iterations, instead of an either-or decision (cooperate or defect) a player can choose anything in between on a sliding scale. For example he can choose 0.23 cooperate - 0.77 defect. I think it's quite interesting and it looks like it has some illuminating implications but it's not what I'm proposing.

PairTheBoard

Quote

05-28-2015 , 03:41 PM

#20

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

I've been puzzling over your comments for several days trying to clarify my thoughts about what's going on here.

First I'd like to rename the model I proposed as The Continuous Time Prisoner's Dilemma (CTPD) so it won't be confused with already established models like the one linked to by river_tilt.

I'd like to denote the Iterated Prisoner's Dilemma with known fixed number of iterations by IPD(N) where N is the number of iterations. Throughout this discussion the IPD(N) will always be normalized. The two players will play iterations I(1),...,I(N) of the PD with payoffs for I(k) determined by the PD matrix multiplied by the normalization factor 1/N. A player's decision on I(k+1) may depend on the decisions made by both players on I(1), ... , I(k). In other words, player strategies can be dynamic.

For the GTO I'm going by this explanation of it in post #1 by The Bryce here:
http://forumserver.twoplustwo.com/94...holdem-245479/
------------------
Game Theory Optimal (GTO): A strategy that yields the highest possible EV (or: “is optimal”) if your opponent always chooses the best possible counter-strategy. In a game of rock-paper-scissors the GTO strategy is to choose randomly from an equal distribution of paper, scissors, and rocks. If you play rock less often than paper, you will have less than ½ equity against an all scissors strategy. Similarly, you must play paper at least as often as you play scissors, and scissors at least as often as you play rock. As a result, you must play paper, scissors, and rocks with equal frequency to guarantee ½ equity against all strategies. So long as your opponent always chooses the optimal counter-strategy to whatever strategy you choose no strategy on your part can have a higher EV than this.
------------------

Not the easiest concept, but if I understand it "Always Defect" is not GTO for IPD(N) in a general population of opponents. It's non-exploitable and any other strategy does worse against it than another Always-Defect does but that doesn't make it GTO. You have to compare it's EV against another Always-Defect to other candidates' EV's against their optimal counter strategies.

For example, look at Tit-for-Tat as an alternate candidate. While Always-Defect is an exploitive counter strategy to Tit-for-Tat it is not the optimal counter strategy. Such a counter strategy must do at least as well as the TFT-Mod modified Tit-for-Tat which varies from Tit-for-Tat only by always defecting on I(N). TFT-Mod also exploits Tit-for-Tat and does so gaining a much higher EV than Always-Defect gains against another Always-Defect.

Always-Defect is only GTO in the highly restricted population of opponents who always defect on I(2), ... , I(N). Of course that population consists of only 2 strategies, Cooperate or Defect on I(1). In other words, when the game has been reduced to a one-off PD.

So where does the idea that Always-Defect is somehow GTO for IPD(N)? It comes from the fact that under adequate freedom for a population of IPD(N) strategies to evolve over multiple runs of strategies playing against each other, there is evolutionary pressure for the population to evolve toward ones looking progressively more and more like Always-Defect, starting at I(N) and working backwards in I(N-1), I(N-2), ... , I(1).

As this post is getting too long I'll continue in the next post.

PairTheBoard

Quote

05-28-2015 , 08:48 PM

#21

TomCowley

Pooh-Bah

Join Date: Sep 2004 Posts: 5,649

I'm not sure that definition is necessarily right. I was under the impression, although I'm not having any luck finding any references to back either position up, that the condition was no matter WHAT strategy the opponent chose. In a zero-sum 2-player game, they're the same thing, because for any fixed strategy you have, his maximum is always your minimum, but here they're not the same because he can pwn both of you with always-defect. Basically yours seems like "Given that I announce whatever strategy in advance, I get the maximum EV assuming my opponent maximizes his EV against that strategy" (cooperating until the last round as we both observed). That probably has a name somewhere.

Quote

05-28-2015 , 10:42 PM

#22

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

When I google GTO I only get references to GTO in poker. As far as I can tell it's a term that was invented for poker probably right here at 2+2. And from what I can tell, everybody points to that post by The Bryce for the best explanation of it. I think the possibly ambiguous phrase is "optimal counter strategy". If "optimal" means you want to minimize your opponent's EV against you then Always-Defect does that against Tit-for-Tat. But if optimal means you want to maximaize your own EV against that opponent then TFT-mod does a good job of that against Tit-for-Tat. The latter makes more sense to me.

To be clear, I haven't proved TFT-mod is the optimal counter strategy for Tit-for-Tat. It is better than Always-Defect. What I've shown is that if TFT-mod is the optimal counter strategy for Tit-for-Tat then Always-Defect cannot be GTO under The Bryce definition not because TFT-mod scores the high EV but because Tit-for-Tat scores a higher EV against its optimal counter strategy than Always-Defect does against its optimal counter strategy. That makes Tit-for-Tat a better candidate for GTO than Always-Defect is.

However it's conceivable there's another counter to Tit-for-Tat which scores a higher EV against it than TFT-mod does, and which screws Tit-for-Tat into a worse EV in their matchup than Always-Defects scores against Always-Defects. I don't think there is such a counter strategy for Tit-for-Tat but if there were then Tit-for-Tat would not a better candidate than Always-Defect for GTO.

If the concept of GTO was invented for Poker or more generally for zero sum games where strategies must be win-lose, lose-win, or even-even then it may not be the best concept for a non-zero sum game like IPD(N) were strategies include lose-lose and win-win along with win-lose, lose-win.

I'm thinking the more interesting thing to look at is the Evolutionary pressure that drives a population of strategies toward Always-Defect in a back tracking step by step way for the IPD(N). That pressure is zero in the CTPD limit and I think it declines toward zero in the IPD(N) as N gets large.

PairTheBoard

Quote

05-28-2015 , 11:33 PM

#23

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

So you're thinking that the GTO strategy is the strategy that gets the best minimal result across all strategies? That would be compatible with the 1/3,1/3,1/3 GTO mixed strategy for rock-paper-scissors. And it would make Always-Defect GTO for the IPD(N). Yea, a reference would be nice. I'll have to think about it. In non-zero sum it seems oriented more toward being best at screwing everybody even if it means cutting off your own nose. A strange kind of "optimization".

PairTheBoard

Quote

05-29-2015 , 12:00 AM

#24

PairTheBoard

Carpal \'Tunnel

Join Date: Dec 2003 Posts: 10,039

Quote:

Originally Posted by TomCowley

Quote:

Originally Posted by PairTheBoard

Suppose we define a game with 100 strategies. Strategies 2-100 all score 100 points when they play against another 2-100 strategy. But they all score 0 points against strategy 1. Strategy 1 scores 1 point against all strategies including itself. Does it make sense to say strategy 1 is GTO?

The place where "against all other strategies" definitely makes sense is the original one-off PD. The reason Defect is so obviously good is that no matter what the other strategy is you can't do better against it than Defect. Defect is the best strategy against all other strategies. That doesn't work for the IPD(N). For that matter, it doesn't work for rock-paper-scissors.

PairTheBoard

Quote

05-29-2015 , 12:08 AM

#25

just_grindin

Pooh-Bah

Join Date: Dec 2007 Posts: 5,263

An equilibrium solution or optimal solution is simply a set of strategies where no player has an incentive to deviate from his or her strategy. Each is maximally exploiting the others at equilibrium (i.e. it's the best exploitation achieveable against villain's equilibrium strategy).

GTO is unique in the poker world and as far as I understand it not used in the academic studies of game theory.

Quote

Page 1 of 2

First

1 2

Last

Post Reply Subscribe

...

Page 1 of 2

First

1 2

Last