Toward Solving 2-TBSG Efficiently
Zeyu Jia, Zaiwen Wen, Yinyu Ye

TL;DR
This paper introduces strongly polynomial algorithms for solving 2-TBSG, a two-player game model, by developing novel simplex strategy iteration methods inspired by Markov decision processes, and transforms general instances into simplified forms.
Contribution
It presents the first strongly polynomial algorithms for 2-TBSG and introduces a transformation method to reduce general cases to simpler two-action scenarios.
Findings
Algorithms exhibit geometric convergence.
Proved strongly polynomial complexity.
Transformation reduces problem complexity.
Abstract
2-TBSG is a two-player game model which aims to find Nash equilibriums and is widely utilized in reinforced learning and AI. Inspired by the fact that the simplex method for solving the deterministic discounted Markov decision processes (MDPs) is strongly polynomial independent of the discounted factor, we are trying to answer an open problem whether there is a similar algorithm for 2-TBSG. We develop a simplex strategy iteration where one player updates its strategy with a simplex step while the other player finds an optimal counterstrategy in turn, and a modified simplex strategy iteration. Both of them belong to a class of geometrically converging algorithms. We establish the strongly polynomial property of these algorithms by considering a strategy combined from the current strategy and the equilibrium strategy. Moreover, we present a method to transform general 2-TBSGs into special…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Metaheuristic Optimization Algorithms Research · Reinforcement Learning in Robotics
Toward Solving 2-TBSG Efficiently
\nameZeyu Jiaa, Zaiwen Wenb and Yinyu Yec CONTACT
Zeyu Jia. Email: [email protected]
Zaiwen Wen. Email: [email protected]
Yinyu Ye. Email: [email protected] aSchool of Mathematical Science, Peking University, China; bBeijing International Center for Mathematical Research, Peking University, China; cDepartment of Management Science and Engineering, Stanford University, Stanford, CA, USA
Abstract
2-TBSG is a two-player game model which aims to find Nash equilibriums and is widely utilized in reinforced learning and AI. Inspired by the fact that the simplex method for solving the deterministic discounted Markov decision processes (MDPs) is strongly polynomial independent of the discounted factor, we are trying to answer an open problem whether there is a similar algorithm for 2-TBSG. We develop a simplex strategy iteration where one player updates its strategy with a simplex step while the other player finds an optimal counterstrategy in turn, and a modified simplex strategy iteration. Both of them belong to a class of geometrically converging algorithms. We establish the strongly polynomial property of these algorithms by considering a strategy combined from the current strategy and the equilibrium strategy. Moreover, we present a method to transform general 2-TBSGs into special 2-TBSGs where each state has exactly two actions.
keywords:
Markov Decision Process; 2-Player Turn Based Stochastic Game; Simplex Strategy Iteration; Strongly Polynomial Time
1 Introduction
Markov decision process (MDP) is a widely used model in machine learning and operations research [1], which establishs basic rules of reinforcement learning. While solving an MDP focuses on maximizing (minimizing) the total reward (cost) for only one player, we consider a broader class of problems, the 2-player turn based stochastic games (2-TBSG) [15], which involves two players with opposite objectives. One player aims to maximize the total reward, and the other player aims to minimize the total reward. MDP and 2-TBSG have many useful applications, see [8, 3, 12, 2, 5, 10, 16].
Similar to MDP, every 2-TBSG has its state set and action set, both of which are divided into two subsets for each player, respectively. Moreover, its transition probability matrix describes the transition distribution over the state set conditioned on the current action, and its reward function describes the immediate reward when taking the action.
We use a strategy (policy) to denote a mapping from the state set into the action set. In our setting, we focus on the discounted 2-TBSG, where the reward in later steps is multiplied by a discounted factor. Given strategies (policies) for both players, the total reward is defined to be the sum of all discounted rewards. We solve a 2-TBSG by finding its Nash equilibrium strategy (equilibrium strategy for short), where the first player cannot change its own strategy to obtain a larger total reward, and the second player cannot change its own strategy to obtain a smaller total reward. MDP can be viewed as a special case of 2-TBSG, where all states belong to the first player. In such cases, the equilibrium strategy agrees with the optimal policy of MDP.
MDPs have their linear programming (LP) formulations [3]. Hence algorithms solving LP problems can be used to solve MDPs. One of the most commonly used algorithm in MDP is the policy iteration algorithm [8], which can be viewed as a parallel counterpart of the simplex method solving the corresponding LP. In paper [18], both the simplex method solving the corresponding LP and the policy iteration algorithms have been proved to find the optimal policy in , where are the number of actions, the number of states and the discounted factor, respectively. Later in [7], the bound for the policy iteration algorithm is improved by a factor to . In [14], this bound is improved to . When the MDP is deterministic (all transition probabilities are either [math] or ), a strongly polynomial bound independent on the discounted factor is proved in [11] for the simplex policy iteration method (each iteration changes only one action): for uniform discounted MDPs and for nonuniform discounted MDPs.
However, there is no simple LP formulation for 2-TBSGs. The strategy iteration algorithm [13], an analogue to the policy iteration, is a commonly used algorithm in finding the equilibrium strategy of 2-TBSGs. It is a strongly polynomial time algorithm first proved in [7] with a guarantee to find the equilibrium in iterations if the discounted factor is fixed. When the discounted factor is not fixed, an exponential lower bound is given for the policy iteration in MDP [4] and for the strategy iteration in 2-TBSG [6]. It is an open problem whether there is a strongly polynomial algorithm whose complexity is independent of the discounted factor for 2-TBSG.
Motivated by the strongly polynomial simplex algorithm for solving MDPs, we present a simplex strategy iteration algorithm and a modified simplex strategy iteration algorithm for the 2-TBSG. In both algorithms each player updates in turn, where the second player always finds the best counterstrategy in its turn. In the simplex strategy iteration algorithm the first player updates its strategy using the simplex algorithm. In the modified simplex strategy iteration algorithm, the first player updates the action leading to the largest improvement after the second player finds the optimal counterstrategy. When the second player is trivial, the 2-TBSG becomes an MDP and the simplex strategy iteration algorithm can find its solution in strongly polynomial time independent of the discounted factor, which is a property not possessed by the strategy iteration algorithm in [7].
We also develop a proof technique to prove the strongly polynomial complexity for a class of geometrically converging algorithms. This class of algorithms includes the strategy iteration algorithm, the simplex strategy iteration algorithm, and the modified simplex strategy iteration algorithm. The complexity for the strategy iteration algorithm given in [7] can be recovered by our techniques. Our techniques use a combination of the current strategy and the equilibrium strategy. We establish a bound of ratio between the difference of value from the current strategy to the equilibrium strategy, and the difference of value from the combined strategy to the equilibrium strategy. Using this bound and the geometrically converging property, we can prove that after a certain number of iterations, one action will disappear forever, which leads to strongly polynomial convergence when the discount factor is fixed. Although we have not fully answered the open progblem, our algorithms and analysis point out a possible way for conquering the difficulities.
Furthermore, 2-TBSG where each state has exactly two actions can be transformed into a linear complementary problem [9]. An MDP where each state has exactly two actions can be solved by a combinatorial interior point method [17]. In this paper we present a way to transform a general 2-TBSG into a 2-TBSG where each state has exactly two actions. The number of states in this constructed 2-TBSG is (we use to hide log factors of ). This result enables the application of both results in [9, 17] to general cases.
The rest of this paper is organized as follows. In Section 2 we present some basic concepts and lemmas of the 2-TBSG. In Section 3 we describe the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm. The proof of complexity of the class of geometrically converging algorithm is given in Section 4. The transformation from general 2-TBSGs into special 2-TBSGs is introduced in Section 5.
2 Preliminaries
In this section, we present some basic concepts of 2-TBSG. Our focus here is on the discounted 2-TBSG, defined as follows.
Definition 2.1**.**
A discounted 2-TBSG (2-TBSG for short) consists of a tuple , where . are the state set and the action set of each player, respectively. is the transition probability matrix, where denotes the probability of the event that the next state is conditioned on the current action . is the reward vector, where denotes the immediate reward function received using action . To be convenient, we use to denote the number of actions, and to denote the number of states.
Given a state in 2-TBSG setting, we use to denote the set of available actions corresponding to state . A deterministic strategy (strategy for short) is defined such that are mappings from to and from to , respectively. Moreover, each state matches to an action in .
For a given strategy , we define the transition probability matrix and reward function with respect to . The -th row of is chosen to be the row of action in , and the -th element of is chosen to be the reward of action . It is easy to observe that the matrix is a stochastic matrix. We next define the value vector and the modified reward function.
Definition 2.2**.**
The value vector of a given strategy is
[TABLE]
Definition 2.3**.**
The modified reward function of a given strategy is defined as
[TABLE]
where is defined as
[TABLE]
Furthermore, for a given 2-TBSG, the optimal counterstrategy against another player’s given strategy is defined in Definition 2.4. The equilibrium strategy is given in Definition 2.5.
Definition 2.4**.**
For player 2’s strategy , player 1’s strategy is the optimal counterstrategy against if and only if for any strategy of player 1, we have
[TABLE]
Player 2’s optimal counterstrategy can be defined similarly: is the optimal counterstrategy against if and only if for any strategy , . Here for two value vector , we say () if and only if () for .
Definition 2.5**.**
A strategy is called an equilibrium strategy, if and only if is the optimal counterstrategy against , and is the optimal counterstrategy against .
To describe the property of equilibrium strategies, we present Theorems 2.6 and 2.7 given in [7, 15]. Theorem 2.6 indicates the existence of an equilibrium strategy.
Theorem 2.6**.**
Every 2-TBSG has at least an equilibrium strategy. If and are two equilibrium strategies, then . Furthermore, for any player 1’s strategy (or player 2’s strategy ), there always exists a player 2’s optimal counterstrategy against (player 1’s optimal counterstrategy against ), and for any two optimal counterstrategy (), we have ().
The next theorem points out a useful depiction of the value function at the equilibrium.
Theorem 2.7**.**
Let be an equilibrium strategy for 2-TBSG. If is a strategy of player 1, and is player 2’s optimal counterstrategy against , then we have . The equality holds if and only if is an equilibrium strategy.
We now define the flux vector of a given strategy .
Definition 2.8**.**
The flux of a given strategy is defined as
[TABLE]
Our next lemma presents bounds and conditions of the flux vector, and the relationship among the value function, the flux vector and reduced costs. This lemma and the following several lemmas can be found in [7]. To make the paper self-contained, we briefly give their proofs.
Lemma 2.9**.**
For any strategy , we have
; 2. 2.
for any , ; 3. 3.
; 4. 4.
, and moreover, .
Proof.
Item (1) is proved by
[TABLE]
Item (2) is due to
[TABLE]
This indicates that , . Hence we have and from item (1). Finally the last two items are obtained from
[TABLE]
and
[TABLE]
∎
In the following, we present a lemma indicating the positiveness or negativeness of the reduced costs of optimal counterstrategies and equilibrium strategies.
Lemma 2.10**.**
A strategy for player 1 is an optimal counterstrategy against player 2’s strategy if only if . 2. 2.
A strategy for player 2 is an optimal counterstrategy against player 1’s strategy if only if . 3. 3.
A strategy is an equilibrium strategy if and only if it satisfies:
[TABLE]
Proof.
If satisfies , then for any player 1’s strategy , we have
[TABLE]
where the last inequality follows from for and .
Suppose that player 1’s strategy is the optimal counterstrategy against player 2’s strategy . For any , and , we let
[TABLE]
Then again from Lemma 2.9 (4), we have
[TABLE]
where the inequality comes from the definition of equilibrium strategies. Since , we have , which indicates that . With this estimation and for , we have proved that . Hence, item (1) is established, and the proof of item (2) is similar. Finally item (3) follows from items (1) and (2) directly. ∎
3 Geometrically Converging Algorithms
Inspired by the simplex method solving the LP corresponding to the MDP and the strategy iteration algorithm given in [7], we propose a simplex strategy iteration (Algorithm 1) and a modified simplex strategy iteration algorithm (Algorithm 2) for 2-TBSG.
The simplex strategy iteration algorithm can be viewed as a generalization of the strongly polynomial simplex algorithm in solving MDPs [11]. In our algorithm, both players update their strategies in turn. In each iteration, while the first player updates its strategy using the simplex method, which means only updating the action with the largest reduced cost, the second player updates its strategy according to the optimal counterstrategy. When the second player has only one possible action and the transition matrix is deterministic, the 2-TBSG reduces to a deterministic MDP. Then the simplex strategy iteration algorithm can find an equilibrium (optimal) strategy in strongly polynomial time independent of , which is a property has not been proven for the strategy iteration [7].
As for the modified simplex strategy iteration algorithm, it can be viewed as a modification of the simplex strategy iteration algorithm. In this algorithm, both players also update their strategies in turn, and the second player always finds the optimal counterstrategies in its moves. However, in each of the first player’s move, only the action is updated which leads to the biggest improvement on the value function when the second player uses the optimal counterstrategy.
It is easy to know that every iteration of the simplex strategy iteration algorithm involves a step of a simplex update and a solution to an MDP. And every iteration of the modified simplex strategy iteration algorithm involves solutions to multiple MDPs. Hence every iteration in both of these two algorithms can be solved in strongly polynomial time when the discounted factor is fixed.
Next we present a class of geometrically converging algorithms used for proving the strongly polynomial complexity for several algorithms in the next section.
Definition 3.1**.**
We say a strategy-update algorithm (algorithms which update strategies for both players in each iteration) is a geometrically converging algorithm with parameter a , if it updates a strategy to such that the following properties holds.
- •
is the optimal counterstrategy against ;
- •
;
- •
If , then is an equilibrium strategy;
- •
The updates of this algorithm satisfies
[TABLE]
To begin with, we exhibit a lemma indicating the geometrically converging property of the value function in the simplex strategy iteration algorithm.
Lemma 3.2**.**
Suppose the sequence of strategy generated by the simplex strategy iteration algorithm is . Then the following inequality holds
[TABLE]
Proof.
According to Algorithm 1, we have
[TABLE]
where the second and third inequalities follow from Lemma 2.9 (2) and the choice of , the fourth inequality follows from Lemma 2.10, and the first inequality and last equation are due to Lemma 2.9 (4) and Lemma 2.10. ∎
Using this lemma, we show in the next proposition that the strategy iteration algorithm, Algorithm 1 and Algorithm 2 all belong to the class of geometrically converging algorithms.
Proposition 3.3**.**
The strategy iteration algorithm given in **[7]** is a geometrically converging algorithm with parameter ; 2. 2.
The simplex strategy iteration algorithm (Algorithm 1) is a geometrically converging algorithm with parameter ; 3. 3.
The modified simplex strategy iteration algorithm (Algorithm 2) is a geometrically converging algorithm with parameter ;
Proof.
It is easy to verify that the previous described three algorithms satisfy the first three conditions in the definition of geometrically converging algorithms. Next, we prove that all of these algorithms satisfy the last condition. For the strategy iteration algorithm, according to Lemma 4.8 and Lemma 5.4 given in [7], we have
[TABLE]
Hence if ( is a constant), then we obtain
[TABLE]
and the last condition of geometrically converging algorithms is verified.
For the simplex strategy iteration algorithm, if we choose ( is a constant), then according to inequality (1) we have
[TABLE]
and the last condition of geometrically converging algorithms is verified.
Finally we consider the modified simplex strategy iteration algorithm. For , let , where is an action of state . Let
[TABLE]
be player 2’s optimal counterstrategy against , and . Then from inequality (1), we have
[TABLE]
According to Algorithm 2, we have
[TABLE]
which leads to the following estimation:
[TABLE]
Therefore, similar to the previous case we can choose such that
[TABLE]
and the last condition of geometrically converging algorithms is verified. ∎
4 Strongly Polynomial Complexity of Geometrically Converging Algorithms
In this section, we develop the strongly polynomial property of geometric converging algorithms if the parameter is viewed as a constant. Slightly different from the proof in [7] for the strategy at the -th iteration, we present a proof by considering the strategy , where is an equilibrium strategy. We show that can be both upper and lower bounded by some proportion of . By applying the property of geometrically converging algorithms, we obtain that after a certain number of iterations, a player 1’s action will disappear in forever.
Theorem 4.1**.**
Any geometrically converging algorithm with a parameter finds the equilibrium strategy in
[TABLE]
number of iterations.
Proof.
Suppose is the sequence generated by a geometrically converging algorithm. We define , where is one of the equilibrium strategy.
According to Lemma 2.10 and the fact that is the optimal counterstrategy against , and the definition of geometrically converging algorithm, we have
[TABLE]
which directly leads to
[TABLE]
According to Lemma 2.10, we have
[TABLE]
which implies
[TABLE]
We next prove the following inequality:
[TABLE]
A direct calculation gives
[TABLE]
where the last inequality is obtained from Lemma 2.10. Then noticing that
[TABLE]
we have
[TABLE]
Then the inequality (4) is proved.
Finally, we prove that for any , either there exists an action in will never belong to when , or we have
[TABLE]
Actually for any , suppose , we obtain
[TABLE]
from (2) and the definition of geometrically converging algorithm. Hence according to (3) and (4), we get
[TABLE]
Therefore, choosing , and because for any , according to Lemma 2.10, we obtain
[TABLE]
from Lemma 2.9. If , we have
[TABLE]
where the first inequality is due to Lemma 2.10 and the second inequality is due to Lemma 2.9. Therefore, combining these two inequalities and the inequality (5) and noticing that , we get
[TABLE]
This leads to contradiction.
The previous derivation means that if does not hold for , then an action of must disappear after forever. Hence every after iterations an action will disappear forever. This process cannot happen for more than times (since there are actions and every strategy has actions), which indicates that for some ,
[TABLE]
It follows from the definition of geometrically converging algorithm that is the equilibrium strategy. This indicates that within
[TABLE]
number of iterations, we can find one of the equilibrium strategies. ∎
Our next theorem presents the complexity of the strategy iteration algorithm, the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm.
Theorem 4.2**.**
The following algorithms has strongly polynomial convergence when the discounted factor is fixed.
- •
The strategy iteration algorithm given in **[7]** can find the equilibrium strategy within iterations;
- •
The simplex strategy iteration algorithm (Algorithm 1) can find the equilibrium strategy within iterations;
- •
The modified simplex strategy iteration algorithm (Algorithm 2) can find the equilibrium strategy within iterations;
Proof.
The proof of this theorem directly follows from Theorem 4.1 and Proposition 3.3. ∎
Remark 1*.*
It is easy to note that the terminated condition of the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm is equivalent to the condition of meeting an equilibrium strategy. Hence the above theorem also indicates that these two algorithms terminate within iterations.
5 Transform General 2-TBSGs into Special 2-TBSGs
We prove in this section that every 2-TBSG can be transformed into a new 2-TBSG where each state has exactly two actions. A formal description is given in the next theorem.
Theorem 5.1**.**
Given a 2-TBSG with states and actions whose state set is , we can construct a new 2-TBSG with state set satisfying the following properties.
- •
The number of states in the constructed 2-TBSG is bounded by a polynomial of and :
[TABLE]
- •
* and the value function at the equilibrium of the constructed 2-TBSG satisfies:*
[TABLE]
where is the equilibrium value function of the original 2-TBSG, and
[TABLE]
Proof.
Our proof consists of two parts. In the first part, we construct a new 2-TBSG where each state has no more than two actions, and the value function at equilibrium of original 2-TBSG can be easily obtained given the equilibrium value of the constructed 2-TBSG (proportional to the value at some states in the constructed 2-TBSG). In the second part, we modify the constructed 2-TBSG so that each state has exactly two actions, while keeping the equilibrium value unchanged by constructing an obvious undesirable action for those states with only one action.
We first construct a binary tree rooted at with exactly leaves, and the depth of the tree is exactly . This tree is called the depth- binary tree of state :
- •
In the first layers, each node has only one child.
- •
In the last layers, it is a binary tree with exactly leaves.
- •
Every leaves has depth .
Each node except the root and all leaves in the depth- binary tree of are assigned with a new state whose owner is same as state (player 1 or player 2). We use to denote the set of states in the first layers, and to denote the set of states in the -th layer. The parameters (transition probabilities, rewards, discounted factor) are given as follows:
- •
For each state in , one or two actions are assigned to it depending on how many children states (its children in the binary tree) it has, with probability 1 leading to a child state and reward 0.
- •
For set , each of their children nodes is assigned with an action of in the original 2-TBSG. This can be done since the total number of children nodes of is exactly . For each state in , its actions are given by its children nodes. The transition probability and reward of taking that action is assigned to be the same as in the original 2-TBSG.
- •
The discounted factor in the constructed 2-TBSG is given by .
A special case of can be viewed in Figure 1 when .
It is easy to obtain that the number of states in the constructed 2-TBSG is no more than
[TABLE]
We next present a definition of final actions and the executing path of a state.
Definition 5.2**.**
For a given strategy in the constructed 2-TBSG cases and , we continue the following process:
- •
;
- •
If is a constructed action (not an action in the original 2-TBSG), then we let to be the state obtained by executing action . Since all constructed actions are deterministic, there is only one choice of . Then let .
- •
If is an action in the original 2-TBSG, then we stop this process, and call to be the final action of , and path to be the executed path from to action .
Notice that the previous described process must be ended in steps, and all states in the executed path of must lie in the depth- binary tree of . For any state and , there exists a unique executed path from to . In Figure 1 we present an example of final actions and executed paths. When the strategy follows bold arrows, the final action of will be , and the executed path from to is .
Based on the final actions, we define the corresponding strategy with respect to in the original 2-TBSG: for each state , is defined to be the final action of in . Next, we prove that for any state , the value of in strategy agrees with times the value of in strategy . Actually, along the trajectory of , we meet a final action every steps, and only final actions have nonzero rewards. Hence values of in and satisfy
[TABLE]
where denotes actions along strategy , and denotes actions along strategy .
What is left in the proof is to show that if is an equilibrium strategy in the constructed 2-TBSG, then is an equilibrium strategy in the original 2-TBSG. For any player 1’s state and action , we use to denote the strategy in the original 2-TBSG:
[TABLE]
In the constructed 2-TBSG, there exists a unique executed path from to action , and for any state on this path , there is only one action in such that the next state when using also lies on . We define player 1’s strategy as follows:
[TABLE]
where can be chosen arbitrarily. Then it is easy to examine that is the corresponding strategy of . Since is an equilibrium strategy of the constructed 2-TBSG, we have
[TABLE]
where the inequality is due to the property of equilibrium strategy. Furthermore, according to Lemma 2.9 and Lemma 2.10, we have , where . This indicates that . Since can be chosen arbitrarily, we have , and similarly . Again according to Lemma 2.10, we obtain that is an equilibrium strategy of the original 2-TBSG.
Next, we handle states with only one actions. For each of such state, our technique is to construct another action which is obviously unacceptable to appear in equilibrium strategies. For state with only one action , we assign it with another action :
- •
The transition probability using action is identical to .
- •
If state belongs to player 1, then the reward of is assigned to be smaller than the reward of .
- •
If state belongs to player 2, then the reward of is assigned to be larger than the reward of .
If we construct actions in such ways, it is obvious that action is inferior to according to its owner (player 1 or player 2). Hence any strategy which possesses is not an equilibrium strategy, since switching action into leads to a better strategy for its owner. Combining these two parts together proves Theorem 5.1. ∎
Remark 2*.*
Since it is easy to obtain the equilibrium strategy from the equilibrium value and vice versa, we can solve the original 2-TBSG by solving the constructed 2-TBSG.
6 Conclusion
In this paper, we propose two different algorithms for 2-TBSG with strongly polynomial complexity: the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm. We propose a class of geometrically converging algorithms and develop a proof technique to prove the strongly polynomial complexity when the discounted factor is fixed. Furthermore, we present how to transform a general 2-TBSG into a special 2-TBSG where each state has exactly two actions. Specifically, our simplex strategy iteration algorithm is coincident with the simplex method in the MDP cases.These analysis and properties shed some light on the open problem of solving the deterministic 2-TBSG in strongly polynomial time independent of the discount factor.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bellman [1966] Richard Bellman. Dynamic programming. Science , 153(3731):34–37, 1966.
- 2Bertsekas et al. [1995] Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control , volume 1. Athena scientific Belmont, MA, 1995.
- 3Derman [1970] Cyrus Derman. Finite state markov decision processes, 1970.
- 4Fearnley [2010] John Fearnley. Exponential lower bounds for policy iteration. In International Colloquium on Automata, Languages, and Programming , pages 551–562. Springer, 2010.
- 5Filar and Vrieze [2012] Jerzy Filar and Koos Vrieze. Competitive Markov decision processes . Springer Science & Business Media, 2012.
- 6Friedmann [2009] Oliver Friedmann. An exponential lower bound for the parity game strategy improvement algorithm as we know it. In 2009 24th Annual IEEE Symposium on Logic In Computer Science , pages 145–156. IEEE, 2009.
- 7Hansen et al. [2013] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM , 60(1):1:1–1:16, February 2013. ISSN 0004-5411. 10.1145/2432622.2432623 . · doi ↗
- 8Howard [1960] Ronald A Howard. Dynamic programming and markov processes., 1960.
