Zero-sum Stochastic Games: Limit Optimal Trajectories
Sylvain Sorin (IMJ-PRG), Guillaume Vigeral (CEREMADE)

TL;DR
This paper investigates the behavior of zero-sum stochastic games as the discount factor approaches zero, focusing on the convergence and properties of limit optimal trajectories of payoffs and occupation measures.
Contribution
It introduces the concept of limit optimal trajectories in zero-sum stochastic games and analyzes their existence, uniqueness, and characterization for absorbing games.
Findings
Existence of limit optimal trajectories as discount factor tends to zero
Uniqueness conditions for these trajectories in absorbing games
Characterization of the structure of limit trajectories
Abstract
We consider zero sum stochastic games. For every discount factor , a time normalization allows to represent the game as being played on the interval [0, 1]. We introduce the trajectories of cumulated expected payoff and of cumulated occupation measure up to time t [0, 1], under -optimal strategies. A limit optimal trajectory is defined as an accumulation point as the discount factor tends to 0. We study existence, uniqueness and characterization of these limit optimal trajectories for absorbing games.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic theories and models · Game Theory and Applications · Stochastic processes and financial applications
Zero-sum stochastic games: limit optimal trajectories
Sylvain Sorin
Sylvain Sorin, Sorbonne Université, UPMC Paris 06, Institut de Mathématiques de Jussieu-Paris Rive Gauche, UMR 7586, CNRS, F-75005, Paris, France
https://webusers.imj-prg.fr/sylvain.sorin](mailto:[email protected])
and
Guillaume Vigeral
**Guillaume Vigeral (corresponding author)
**Université Paris-Dauphine, PSL Research University, CNRS, CEREMADE, Place du Maréchal De Lattre de Tassigny. 75775 Paris cedex 16, France
http://www.ceremade.dauphine.fr/ vigeral/indexenglish.html ](mailto:%[email protected])
Abstract.
We consider zero sum stochastic games. For every discount factor , a time normalization allows to represent the game as being played on the interval . We introduce the trajectories of cumulated expected payoff and of cumulated occupation measure up to time , under -optimal strategies. A limit optimal trajectory is defined as an accumulation point as the discount factor tends to 0. We study existence, uniqueness and characterization of these limit optimal trajectories for absorbing games.
Some of the results of this paper were presented in “Atelier Franco-Chilien: Dynamiques, optimisation et apprentissage” Valparaiso, November 2010 and a preliminary version of this paper was given at the Game Theory Conference in Stony Brook, July 2012. This research was supported by grant PGMO 0294-01 (France)
1. Introduction
The analysis of two person zero sum repeated games in discrete time may be performed along two lines:
-
asymptotic approach: to each probability distribution on the set of stages () one associates the game where the evaluation of the stream of stage payoffs is , and one denotes its value by . Given a preordered family of probability distributions, one studies whether converges as “goes to ” according to . Typical exemples correspond to -stage games (), -discounted games (), or more generally decreasing evaluations () with . The game has an asymptotic value when these limits exist and coincide.
-
uniform approach: for each strategy of player 1, one evaluates the amount that can be obtained against any strategy of the opponent in any sufficiently long game for . This allows to define a and the game has a uniform value when .
The second approach is stronger than the first one (the existence of implies the existence of and their equality), but there are games with asymptotic value and no uniform value (incomplete information on both sides, Aumann and Maschler [1], Mertens and Zamir [7]; stochastic games with signals on the moves). The first approach deals only with families of values while the second explicitly consider strategies. The main difference is that in the first case the ()-optimal strategies of the players may depend on the evaluation represented by .
We focus here on a class of games where this dependence has a smooth representation. Basically in addition to the asymptotic properties of the value one studies the asymptotic behavior along the play induced by ()-optimal strategies.
The first step is to normalize the duration of the game using the evaluation . We consider each game as being played on , stage lasting from time to time (with ).
Note that here time corresponds to the fraction of the total duration of the game, as evaluated trough . In particular given ()-optimal strategies in the stream of expected stage payoffs generate a bounded measurable trajectory on and one will consider its asymptotic behavior.
The next section introduces the basic definitions and concepts that allow to describe our results. The main proofs are in Section 3. Further examples are in Sections 4 and 5.
To end this quick overview let us recall that there are games which do not have an asymptotic value: stochastic games with compact action spaces, Vigeral [15]; finite stochastic games with signals on the state, Ziliotto [16]; or more generally Sorin and Vigeral, [13].
2. Limit optimal trajectories
Let be a two-person zero-sum stochastic game with state space , action spaces and , stage payoff and transition from to IR (resp. ). We assume that is finite, and are compact metric, and continuous. We keep the same notations for the multilinear extensions to and , where as usual denotes the probabilities on .
For any pair of stationary strategies , any state and any stage , denote by the expected payoff at stage under these stationary strategies, given the initial state , and by the corresponding distribution of the state at stage . Hence where stands for the vector payoff with component in state given by .
Definition 1**.**
For any , any discount factor , and any starting state , define the function by*
[TABLE]
for and a linear interpolation between these dates .
Thus corresponds to the expectation of the accumulated payoff for the first stages in the discounted game, or up to time and to the same at the fraction of the game, both under and in the -discounted game starting from .
Let denote the set of positive measures on . We introduce similarly the expected accumulated occupation measure at time under and in the -discounted game starting from as follows:
Definition 2**.**
For any , any discount factor , and any starting state , define the function by:*
[TABLE]
and by a linear interpolation between these dates .
Note that for any , .
Denote by and the -vectors of functions and respectively.
Limit trajectories for the payoff and occupation measures will be defined as accumulation points of and under dependent -optimal strategies and as tends to 0.
More precisely, denote by (resp. ) the set of -optimal stationary strategies in the -discounted game (with value ) for Player 1 (resp. for Player 2).
Then we introduce:
Definition 3**.**
* is a limit optimal trajectory for the expected accumulated payoff () if :*
[TABLE]
* is a limit optimal trajectory for the expected accumulated occupation measure () if :*
[TABLE]
Alternate weaker and stronger definitions are as follows: in both cases, if “” is replaced by “for some going to 0”, we will speak of a weak . If “” is replaced by ‘”, we will speak of a strong .
Remark 4**.**
-
a weak always exists by standard arguments of equicontinuity.
-
if a exists, converges to .
-
if a strong exists, it is unique
-
no strong exists in general (just consider a game where payoff is always 0).
-
if the game has a uniform value and both players use -optimal strategies the average expected payoff is essentially constant along the play.
A first approach to this topic concerns one player games (or games where one player controls the transitions), where there is no finiteness assumption on . Assume that converges uniformly then there exists a strong LOTP and it is linear w.r.t. , which means that the expected payoff is constant along the trajectory (Sorin, Venel and Vigeral [14]).
The same article provides an example of a two player game with finite action and countable state spaces, where LOTP is not unique.
The main contributions of the current paper are:
For absorbing games, existence of a linear LOTP, and existence of a “geometric” algebraic LOTM
For finite absorbing games, existence of a strong LOTP.
An exemple of a finite game where LOTM is not semialgebraic.
An example of compact absorbing game with non uniqueness of LOTP.
Let us mention recent results of Oliu-Barton and Ziliotto [8] establishing the existence of linear strong LOPT for finite stochastic games and optimal strategies: the class of games is larger and they allow for any kind of optimal strategies. Our results deal with compact action spaces and -optimal strategies.
As a final comment, let us underline the fact that the previous concepts and definitions can be extended to any repeated game, for any evaluation and any type of strategies.
3. Absorbing games
An absorbing game is defined by two sets of actions and , two stage payoff functions , from to and a probability of absorption from to
and are compact metric sets, and are (jointly) continuous.
The repeated game is played in discrete time as follows. At stage (if absorption has not yet occurred) player 1 chooses and, simultaneously, player 2 chooses :
(i) the payoff at stage is ;
(ii) with probability absorption is reached and the payoff in all future stages is ;
(iii) with probability the situation is repeated at stage .
Recall that the asymptotic analysis for these games is due to Kohlberg [3] in the case where and are finite and Rosenberg and Sorin [9] in the current framework. In either case the value of the discounted game converges to some as goes to 0. This does not require any assumption on the information of the players. In case of full observation of the actions - or of the stage payoff, a uniform value exists, see Mertens and Neyman [5] in the finite case and Mertens, Neyman and Rosenberg [6] for compact actions.
Recall that and are the sets of probabilities on and . The functions , and are bilinearly extended to . Let
[TABLE]
is thus the expected absorbing payoff conditionally to absorption (and is thus only defined for ).
3.1. An auxiliary game
Consider the two-person zero-sum game , defined for any and , by the payoff function
[TABLE]
3.1.1. General properties
The following proposition extends to the compact case results due to Laraki [4] in the finite case (later simplified by Cardaliaguet, Laraki and Sorin [2]).
Proposition 5**.**
- The game has a value, which is .
More precisely*
[TABLE]
2) Moreover, if is -optimal in the game then for any small enough the stationary strategy is -optimal in .
Proof.
- Consider an accumulation point of the family and let such that converges to .
We will show that
[TABLE]
A dual argument proves at the same time that the family converges and that the auxiliary game has a value.
Let be the payoff in the game , induced by a pair of stationary strategies . It satisfies
[TABLE]
hence
[TABLE]
In particular for any optimal for Player 1 in , one obtains
[TABLE]
that one can write
[TABLE]
Let be an accumulation point of and given let in the sequence such that
[TABLE]
(we use the fact that is uniformly continuous on ) and
[TABLE]
Then with and , (6) implies
[TABLE]
On the other hand, going to the limit in (5) leads to
[TABLE]
We multiply (7) by the denominator and we add to (8) multiplied by to obtain the property:
and such that
[TABLE]
which implies (2). Note moreover that is independent of , hence the result.
- Let be -optimal in the game and . Using (4) one obtains
[TABLE]
Note that
[TABLE]
Thus
[TABLE]
where is a bound on the payoffs. Hence, for any
[TABLE]
for small enough. ∎
is an auxiliary limit game in the sense that:
i) There is a map from to (that associates to a strategy of player 1 in and a discount factor a stationary strategy of player 1 in ).
ii) There is a map from to (that associates to a stationary strategy of player 2 in and a discount factor a stationary strategy of player 2 in ).
iii)
[TABLE]
iv) A dual property holds.
These properties imply: exists and equals .
We then recover Corollary 3.2 in Sorin and Vigeral [12], with a new proof that will be useful in the sequel.
Corollary 6**.**
[TABLE]
where is the median of three numbers, and with the usual convention that ; . Moreover if (resp ) is -optimal in then (resp. ) is -optimal in (10).
Proof.
For any fix a triplet of the second player -optimal in , where we can assume that does not depend on by the previous result. Then for any such that , one has
[TABLE]
thus . Denote similarly .
On the other hand, for any
[TABLE]
Now if , , hence in all cases
[TABLE]
Thus for any
[TABLE]
Letting go to 0 and using the dual inequality establish the results. ∎
3.1.2. Further properties of optimal strategies
We establish here more precise results concerning the decomposition of the payoff induced by -optimal strategies in the game .
Proposition 7**.**
Let and be -optimal in the game .
- a)
If then 2. b)
** 3. c)
If then .
Proof.
a) This is exactly equation (12) and its dual.
b) From we get
[TABLE]
On the other hand, hence . Combining both inequalities yields
[TABLE]
and the dual inequality is similar.
c) Since , one has
[TABLE]
hence
[TABLE]
and the dual inequality is similar. ∎
3.2. Asymptotics properties in
Since the game is absorbing, we write simply for , where is the nonabsorbing state.
Lemma 8**.**
Let and be two families of (non necessarily optimal) stationary strategies of Player 1 and Player 2 respectively. Assume that converges to some in as goes to 0. Then converges uniformly in to as goes to 0, with the natural convention that for .
Proof.
By definition 2, for any and ,
[TABLE]
with linear interpolation between these dates.
Remark first that this implies that for all and , which gives at the limit the desired result if .
Assume now that , and thus that tends to 0 as goes to 0. Fix and , and let be the integer part of so that . Since is decreasing in ,
[TABLE]
Similarly,
[TABLE]
Letting go to 0 in (15) and (16) yields the result. ∎
For any and define
[TABLE]
An immediate consequence of the previous lemma is
Corollary 9**.**
Let and and denote and . Then converges uniformly in to as goes to 0.
Proposition 10**.**
Any absorbing game has a LOTM for some , and a LOTP .
Proof.
For every let and be -optimal strategies for each player in (recall that and can be chosen independently on ). Up to extraction converges to some in .
Fix and let such that
[TABLE]
on [0,1]. By Proposition 5 the strategies and are -optimal in for small enough. Corollary 9 and equation (17) imply that
[TABLE]
for all small enough and . This answers the first part of the Proposition.
Clearly
[TABLE]
Recall that the payoff function is assumed bounded by 1. Then equations (18) and (19) imply that for small enough and every ,
[TABLE]
Since and converge to and , we then have for small enough
[TABLE]
.
We now consider four separate cases. Basically either or and equation (20) implies that is linear, and hence equals since by near optimality of the strategies and in ; or and then both and are close to by Proposition 7, which once again implies .
Case 1 : . Then converges to as go to 0, and by Proposition 7 a) . Since in that case, equation (20) yields for all small enough, uniformly in .
Case 2 : and , hence (up to chosing a larger ). Then Proposition 7 b) implies , and equation (20) yields for all small enough, uniformly in .
Case 3 : and , hence (up to chosing a larger ). Then Proposition 7 b) implies . Moreover, implies that converges to as go to 0, and by Proposition 7 c) .
Hence equation (20) yields for all small enough, uniformly in .
Case 4 : and , hence (up to chosing a larger ). Then converges to as go to 0, and by Proposition 7 c) . Thus equation (20) yields for all small enough, uniformly in .
As claimed, in every case we see that . ∎
Remark 11**.**
Recall that and represent expected cumulated occupation measure and payoff. By deriving these quantities with respect to we get that the asymptotic probability of still being in the non absorbing state at time is , and that the current asymptotic payoff is at any time.
Remark 12**.**
Let us give a simple heuristic behind the form for . Assuming that this quantity is well defined and smooth, note that at time the remaining game has a length and weight hence by renormalization (see figure below)
[TABLE]
so that
[TABLE]
which leads, with to for some .
[math]t$$1$$q(t)$$1$$q^{\prime}(0)$$q^{\prime}(t)
Let us illustrate now the four cases in the preceding proof by giving examples.
Example 13**.**
Consider the absorbing game
[TABLE]
with asymptotic value 1. Let . Then is -optimal in for Player 1, while any is optimal for Player 2. Since for all case 1occurs for any choice of , hence the corresponding is and for all .
Notice that in the only optimal stationary strategy is for each player, leading to the same asymptotic trajectory . I
Example 14**.**
Consider the absorbing game
[TABLE]
with asymptotic value 1. Then is -optimal in , while any is optimal. The associated is , hence either and case 2 holds with and , or and case 4 occurs with and .
Notice that in the only optimal stationary strategy is for each player. Since , Lemma 8 implies that the asymptotic trajectory associated to optimal strategies is . Moreover for any the strategy of Player 2 is -optimal in for small enough, and hence any of the form is an asymptotic behavior.
Example 15**.**
Consider the Big Match
[TABLE]
with asymptotic value 1/2. Let . Then and are optimal in , with . Hence case 3 holds, and the corresponding is . The optimal strategies in are and respectively, leading to the same .
3.3. Finite case
We now prove that when the game is finite, the limit payoff trajectory is linear for every couple of near optimal stationary strategies, not only those given by Proposition 5. That is, is a strong limit behavior for the cumulated payoff.
Proposition 16**.**
Let be a finite absorbing game with asymptotic value , and families of -optimal stationary strategies in , with going to 0 as goes to 0. Then for every , converges to as goes to 0.
We will use in the proof of this proposition the following elementary lemma given without proof.
Lemma 17**.**
Let be real numbers with and positive. Then with equality if and only if
Proof of Proposition 16.
The result is clear for or , assume by contradiction that it is false for some . Hence there is a sequence going to 0 and optimal strategies and such that converges to with . Up to extraction of subsequences, and converge to and respectively. Also up to extraction, all the following limits exist in : , , and . If (and hence for large enough), denote , which also exists up to extraction.
Recall formula (4):
[TABLE]
and since and are families of -optimal strategies in , converges to as tends to infinity.
Recall that by Lemma 8, at the limit
[TABLE]
We first claim that .
If , , and by near optimality of and , hence a contradiction.
If , , and by near optimality of and , hence a contradiction.
Hence , and is a nontrivial convex combination of and . Since is also a convex combination of and , the assumption that implies . Assume without loss of generality .
We class the actions of Player 1 in 4 categories to :
- •
if ,
- •
if and ,
- •
if and ,
- •
if and .
Hence actions of category 1 are of order 1, actions of category 3 are of order , actions of category 4 are of order , and actions of category 2 are played with probability going to 0 but large with respect to . Define categories to of player 2 in a similar way. By definition,
[TABLE]
Recall that the left hand side converges to , hence up to extraction converge in for any and , denote by the limit. If and are of category and with then . If then diverges to which implies that .
Hence going to the limit in (22) we get that where where , , and . Recall that thus at least one of , or is positive as well.
Similarly, passing to the limit in the definition of yields where , , and . Note that if for some then as well.
Finally, going to the limit in equation (21) yields
[TABLE]
and similarly going to the limit in the definition of yields
[TABLE]
Recall that we assumed . By Lemma 17, there exists such that and .
Assume first that . Consider now the following strategy : for , and for all other except for an arbitrary for which . The only effect of this deviation is that now . Hence
[TABLE]
which is strictly less than by Lemma 17 since ; this contradicts the -optimality of .
Assume next that . Consider now the following strategy : for , and for all other except for an arbitrary for which . The only effect of this deviation is that now . Hence
[TABLE]
which is strictly less than by Lemma 17 since ; this again contradicts the -optimality of .
Finally assume that . Consider now the following strategy : for and for all other except for an arbitrary for which (which is nonnegative for large enough). The only effect of this deviation is that now and . Hence
[TABLE]
which is strictly more than by Lemma 17 since ; this contradicts the -optimality of . ∎
4. Stochastic finite games
4.1. Non algebraic limit trajectories
Consider the following zero-sum stochastic game with two non absorbing states and two actions for each player. In the first state (which is the starting state) the payoff and transitions are as follows:
[TABLE]
where denotes absorption and that there is a deterministic transition to state 2. Starting from the second state the game is a linear variation of the Big Match:
[TABLE]
Since for all and since there is no return once the play has entered state , it implies that the optimal play in state is the same than in the Big Match, in which . So in both states the optimal strategies in are for Player 1 and for Player 2. By a scaling of time, the preceding section tells us that at the limit game, the probability of being in state 2 at time , given that there were transition from to at time , is . Since (also from the preceding section) the time of transition from to another state (which is with probability ) has a uniform law on , the probability of being in at time is
[TABLE]
Notice that this not an algebraic function of as it was always the case in the preceding section. Similarly the probability of absorption before time is and is also non algebraic.
4.2. No optimal strategies of the form
In the following game with two non absorbing states the payoff is always 1 in state and -1 in state with the following deterministic transitions
[TABLE]
[TABLE]
It is easy to see that the asymptotic value is 0 and that optimal strategies in the -discounted game put a weight on and in both states, hence the absorbing probability is of the order of per stage in each state.
We show that strategies of the form , cannot guarantee more than -1 to player 1, as goes to [math].
- If , player 2 plays in and in inducing an absorbing payoff of -1.
From we reach eventually were has a positive probability.
- If , player plays in both games inducing an absorbing payoff of -1.
The payoff will be absorbing with high probability in finite time and the relative probability of vanishes with .
- If , player plays in both games inducing a non absorbing payoff of -1.
For small, most of the time the state is .
- If , player plays in game and in game . The event “absorbing payoff of 1” occurs at stage if and . Hence and . Now this event “ and ” has probability of order . Then the absorbing component of the discounted payoffs converges to 0 with . Moreover the non absorbing payoff is mainly .
5. An absorbing game with compact action sets and non linear LOTP
We consider the following absorbing game with compact actions sets. There are three states, two absorbing and , and the non absorbing state , in which the payoff is 1 whatever the actions taken. The sets of action are with the usual distance. The probabilities of absorption are given by :
[TABLE]
and
[TABLE]
It is easily checked that both functions and are (jointly) continuous.
Proposition 18**.**
*For any discount factor , 0 (resp. 1) is optimal for Player 1 (resp. Player 2) in the -discounted game, and .
The corresponding payoff trajectory is: on .*
Proof.
Action 0 of Player 1 ensures that there will never be absorption to state , and thus that the stage payoff from stage 2 on is nonnegative. Action 1 of Player 2 ensures that there will be absorption with probability 1 at the end of stage 1, and thus that the stage payoff from stage 2 on is nonpositive. Since the payoff in stage 1 is 1 irrespective of player’s actions, the proposition is established.
Notice that the play under this couple of optimal strategies is simple: there is immediate absorption to , and in particular the limit payoff trajectory is linear and equals 0 for every time . ∎
We now prove that there are other -optimal strategies, with a different limit payoff trajectory. Denote where is the integer part ; hence and for all .
Proposition 19**.**
*For any discount factor , is -optimal for Player 1 and -optimal for Player 2 in the -discounted game.
The corresponding payoff trajectory is: .*
Proof.
If both players play , the payoff in the -discounted game is, according to formula (4),
[TABLE]
which is nonnegative since . On the other hand, since , one gets .
If Player 1 plays while Player 2 plays , there is no absorption to hence .
Thus is -optimal for Player 1.
If If Player 2 plays while Player 1 plays , then, according once again to formula (4),
[TABLE]
Thus is -optimal for Player 2.
Notice that while the limit value is 0 and is a couple of near optimal strategies, along the induced play the nonabsorbing payoff is 1 and the absorbing payoff is -1. One can compute that the associated is 1, hence under these strategies . So that the accumulated limit payoff up to time is , which is non linear and positive for every . ∎
Basically the players use a jointly controlled procedure either to follow or to get at most (resp. at least) 0.
6. Concluding comments
A first serie of interesting open questions is directly related to the results presented here like:
extension of Proposition 12 to general (not stationary) strategies,
or more generally analysis in the framework of arbitrary (not discounted) evaluations and general stochastic games.
It is also natural to consider other families of repeated games: a first class that is of interest is games with incomplete information. The natural equivalent of is in this framework is the speed at which the information is transmitted during the game.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Aumann R.J. and M. Maschler: Repeated Games with Incomplete Information , M.I.T. Press, (1995).
- 2[2] Cardaliaguet P., R. Laraki and S. Sorin: A continuous time approach for the asymptotic value in two-person zero-sum repeated games. SIAM J. Control and Optimization, 50 , 1573-1596, (2012).
- 3[3] Kohlberg E.: Repeated games with absorbing states. Annals of Statistics , 2 , 724-738, (1974).
- 4[4] Laraki R.: Explicit formulas for repeated games with absorbing states. International Journal of Game Theory, 39 , 53-69, (2010).
- 5[5] Mertens J.-F. and A. Neyman: Stochastic games. International Journal of Game Theory , 10 , 53-66, (1981).
- 6[6] Mertens J.-F., A. Neyman and D. Rosenberg: Absorbing games with compact action spaces. Mathematics of Operations Research, 34 , 257-262, (2009).
- 7[7] Mertens J.-F. and S. Zamir: The value of two-person zero-sum repeated games with lack of information on both sides. International Journal of Game Theory , 1 , 39-64, (1971).
- 8[8] Oliu-Barton M. and B. Ziliotto: Constant payoff in zero-sum stochastic games, ar Xiv:1811.04518 v 1, (2018).
