A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games
Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar

TL;DR
This paper introduces a generalized minimax Q-learning algorithm for two-player zero-sum stochastic games, extending successive relaxation techniques to improve computation speed and proving its convergence without known model information.
Contribution
The paper develops a novel generalized minimax Q-learning algorithm for zero-sum games, extending successive relaxation methods and providing convergence proof under stochastic approximation.
Findings
Faster computation of min-max values under certain game structures
Convergence of the proposed algorithm is proven using stochastic approximation
Experimental results demonstrate the algorithm's effectiveness
Abstract
We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the…
| Algorithm | 10 states | 20 states | 50 states | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
||||||||
|
|
|
|
||||||||
|
|
|
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsQ-Learning
A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games
Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar The authors are with the Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560012, India (e-mails: [email protected]; [email protected]; [email protected]).
Abstract
We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques, under an assumption on the boundedness of iterates. Through experiments, we demonstrate the effectiveness of our proposed algorithm.
0018-9286 ©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. A version of this paper is accepted for publication at the IEEE Transactions on Automatic Control. DOI: 10.1109/TAC.2022.3159453
I Introduction
In two-player zero-sum games, there are two agents that are competing against each other in a common environment. Based on the actions taken by the agents, they receive a payoff corresponding to the current state and the environment transitions to the next state. The objective of an agent (say agent 1) is to compute a sequence of actions starting from a given state to maximize the total discounted payoff. On the other hand, the objective of the second agent (agent 2) is to compute a sequence of actions that minimizes the total discounted payoff. This problem is formulated as a Markov game and the value that is obtained as the min-max of the total expected discounted payoff starting from state is called the min-max value of state . The policies that achieve this min-max value are the optimal policies of the agents.
When the model information of the environment is known, a Bellman operator for the two-player zero-sum game [1] is constructed and a fixed point iteration scheme analogous to the value iteration is used to compute the min-max value. However, in most two-player zero-sum game settings, the model information is assumed unknown to the players and the objective is to compute optimal policies utilizing the state and payoff samples obtained from the environment.
In our work, we construct a modified min-max Q-Bellman operator by using the technique of successive relaxation for the Markov games and prove that the contraction factor is at most the contraction factor of the standard min-max Bellman operator. This implies that, when the model information is known, the min-max value can be computed faster using our proposed scheme. We then proceed to develop a generalized minimax Q-learning algorithm based on the modified min-max Q-Bellman operator.
The minimax Q-learning algorithm has been presented in [2]. Two-player general sum games are those where the payoffs of the agents are unrelated in general. If the payoff of an agent is the negative of the payoff of another agent, the game reduces to a zero-sum game. A Nash Q-learning algorithm for solving general sum games is proposed in [3]. In [4], Friend-or-Foe (FF) Q-learning for general sum games is proposed and is shown to have stronger convergence properties compared to Nash Q-learning. A generalization of Nash Q-learning and FF Q-learning, namely correlated Q-learning, is discussed in [5]. In [6], desirable properties for an agent learning in multi-agent scenarios are studied and a new learning algorithm namely “WoLF policy hill climbing” is proposed. Surveys of algorithms for multi-agent learning and multi-agent Reinforcement learning are provided in [7, 8].
We now discuss some variants of minimax Q-learning in the literature. In [9], the minimax TD-learning algorithm that utilizes the concept of temporal difference learning is proposed. The minimax version of the Deep Deterministic policy gradient algorithm has been recently developed in [10]. However, no convergence proofs or theoretical guarantees are provided.
The concept of successive relaxation in the context of Markov Decision Processes (MDPs) has been first applied in [11]. In our recent work [12], we have proposed successive over-relaxation Q-learning for model-free MDPs (i.e., in the single-agent scenario). In this work, we extend the concept of successive relaxation to the two-player zero-sum games and propose a provably convergent generalized minimax Q-learning. The contributions of the paper are as follows:
- •
We present a modified min-max Q-Bellman operator for two-player zero-sum Markov games and show that the operator is a max-norm contraction.
- •
We show that under some assumptions, the contraction factor of the modified min-max Q-Bellman operator is smaller than the standard min-max Q-Bellman operator.
- •
We propose a model-free generalized minimax Q-learning algorithm and prove its almost sure convergence using ODE based analysis of stochastic approximation, under an assumption on the boundedness of iterates
- •
We discuss an interesting relation between standard minimax Q-learning and our proposed algorithm.
- •
Finally, through experimental evaluation, we show that our proposed algorithm has a better performance compared to the standard minimax Q-learning algorithm.
We note here that the Successive Over Relaxation (SOR) technique utilized to derive our algorithm and stochastic approximation arguments employed in the convergence analysis are well-known in the literature. Our contribution comprises of applying these techniques to derive and analyze a generalized minimax Q-Learning algorithm that has faster convergence.
II Background and Preliminaries
In this paper, we consider the setting of two-player zero-sum Markov games [13]. The two players in the game are referred to as agent 1 and agent 2. A two-player zero-sum Markov game is characterized by the tuple where is the set of states that both the agents observe, is the finite set of actions of agent 1, is the finite set of actions of agent 2, denotes the transition probability rule, i.e., denotes the probability of transition to state from state when actions and are chosen by the agents 1 and 2, respectively. Let denote the single-stage payoff obtained by the agent 1 in state when actions and are chosen by agents 1 and 2, respectively. Note that, in the case of a zero-sum Markov game, the payoff of the agent 2 is the negative of the payoff obtained by the agent 1. Also, denotes the discount factor. The goals of the two agents in the Markov game are to individually learn the optimal policies and , respectively, where denotes the probability simplex in and (resp. ) indicates the probability distribution over actions to be taken by the agent 1 (resp. agent 2) in state that maximizes (resp. minimizes) the discounted objective given by:
[TABLE]
where is the state of the game at time , and is the expectation taken over the states obtained over time . Let denote the min-max value in state obtained by solving (1). It can be shown ( [14, Chapter 7]) that the min-max value function, , satisfies the following fixed point equation in , given by:
[TABLE]
where is a matrix of size , whose entry is given by and the function , for a given matrix , is defined as follows:
[TABLE]
where and , respectively. The system of equations in (2) can be rewritten as:
[TABLE]
where the operator , for a given , is defined as:
[TABLE]
The operator and the set of equations (2) are analogous to the Bellman operator and the Bellman optimality condition, respectively, for Markov Decision Processes (MDPs) [14].
III The Proposed Algorithm
We describe a single iteration of the synchronous version [15] of our proposed algorithm in Algorithm 1 below. At each iteration , Q-values of all the state-action tuple are updated as shown in the step 4 of Algorithm 1.
Remark 1**.**
Note that the step 3 of Algorithm 1 requires computation of which is a linear program. Also observe that the generalized minimax Q-learning algorithm only requires an additional computation of compared to the standard minimax Q-learning.
Remark 2**.**
Let be the solution obtained by Algorithm 1 upon termination after iterations. Then the approximate min-max value, for a given state is obtained as follows:
[TABLE]
and the corresponding approximate policies of the agents are obtained as:
[TABLE]
IV Convergence Analysis
Let denote the probability simplex in . For matrix and , recall that the value of the matrix is defined as . Note that the norm considered in this section is the max-norm, i.e., norm of the vector is . We first derive a few properties of the operator that would be used in the subsequent analysis.
Lemma 1**.**
Suppose , then
[TABLE]
Proof.
[TABLE]
In the above, , and , respectively. Similarly,
[TABLE]
[TABLE]
Therefore Note the repeated application of the facts , in the arguments. This completes the proof. ∎
Corollary 1**.**
Consider , then
[TABLE]
Proof.
Using Lemma 1 with , we get:
[TABLE]
Lemma 2**.**
Let , where Then for constants , and
[TABLE]
Proof.
By definition of the val operator, for , ,
[TABLE]
Recall that for a given stochastic game the min-max value function satisfies [16] the system of equations,
[TABLE]
where is a matrix with entry
[TABLE]
and the system of equations can be reformulated as the fixed point equation, , with being a contraction under the max-norm with contraction factor .
We define a quantity as follows:
[TABLE]
As the probabilities , it is clear that . For , we now define a modified operator as follows [11]:
[TABLE]
where represents a prescribed relaxation factor. Note that is in general not a convex combination of and the identity operator since we allow as (see above). Let denote the min-max value of the Markov game. Therefore, . Now,
[TABLE]
Therefore, the min-max value function is also a fixed point of
Next, we derive a modified min-max Q-Bellman operator for the two-player zero-sum game. Let be defined as follows:
[TABLE]
Now let
[TABLE]
Let with . Then,
[TABLE]
Hence the equation (8) can be rewritten as follows:
[TABLE]
Let be defined as follows. For ,
[TABLE]
is the modified Q-Bellman operator for the two-player zero-sum Markov game.
Lemma 3**.**
For with as in (6), the map is a max-norm contraction and is the unique fixed point of .
Proof.
From equation (IV), is a fixed point of . Therefore, it is enough to show that is a contraction operator (which will also ensure its uniqueness). For , we have
[TABLE]
[TABLE]
Since the RHS is not a function of , we have
[TABLE]
Note the use of the assumption (with as in (6)) in equation (10) that ensures that the term \big{(}1-w+w\alpha p(i|i,u,v)\big{)}\geq 0, to arrive at equation (11). Also equation (12) is obtained by an application of Lemma 1 in equation (11). From the assumptions on and discount factor , it is clear that . Therefore is a max-norm contraction with contraction factor and is its unique fixed point. ∎
Lemma 4**.**
* is a contraction with contraction factor *
Proof.
The proof is analogous to the proof of the Lemma 3. ∎
Lemma 5**.**
For , the contraction factor for the map
[TABLE]
Proof.
For , define Let Then Hence is decreasing. In particular, for , This shows that, if and is chosen such that , the contraction factor is strictly smaller than . ∎
Remark 3**.**
Depending on the choice of , the following observations can be made about our proposed generalized minimax Q-learning algorithm (refer Algorithm 1).
- •
Case I () :* The generalized minimax Q-learning reduces to standard minimax Q-learning.*
- •
Case II () :* The contraction factor of in this case, , giving rise to minimax Q-learning algorithm with slower convergence.*
- •
Case III () :* For this choice of , it is required that (refer equation (6)). Under this condition, as shown in Lemma 5, the contraction factor of , , giving rise to a faster minimax Q-learning algorithm.*
Lemma 6**.**
Let and be the fixed point of . Then for all ,
[TABLE]
Moreover, .
Proof.
By the hypothesis on , . Since is the fixed point of , we have
[TABLE]
Therefore
[TABLE]
This completes the proof. ∎
This Lemma is an interesting and important result in our paper. It shows that, even if the standard minimax Q-value iterates and generalized minimax Q-value iterates are not the same for all tuples, the min-max values at each state given by both the algorithms are equal. Therefore, this lemma states that generalized minimax Q-value iteration computes the min-max value function, which is the goal of the two-player zero-sum Markov game. We now show the convergence of generalized minimax Q-learning (refer Algorithm 1). For this purpose, we first state the following result (Proposition 4.5 on page 157 of [14]) and apply it to show the convergence of our proposed algorithm. We consider to be deterministic as with our algorithm, unlike [14] where these are allowed to be random.
Theorem 1**.**
Let be the sequence generated by the iteration
[TABLE]
- •
The step-sizes are non-negative and satisfy
[TABLE]
- •
The noise terms satisfy
- –
For every and , we have where
[TABLE]
- –
Given any norm on , there exist constants and such that
[TABLE]
- •
The mapping is a max-norm contraction.
Then, converges to the unique fixed point of , with probability 1.
Theorem 2**.**
Given a finite state-action two-player zero-sum Markov game with bounded payoffs i.e. , the generalized minimax Q-learning algorithm (see Algorithm 1) given by the update rule:
[TABLE]
converges with probability 1 to as long as
[TABLE]
for all .
Proof.
The update rule of the algorithm is given by
[TABLE]
Let \mathcal{F}_{n}=\sigma\big{(}\{Q_{0},Y_{j},\forall j<n\}\big{)},n\geq 0 be the associated filtration. Now observe that Also, given , assume that the random variables are independent. Then the above equation can be rewritten as:
[TABLE]
where
[TABLE]
and
[TABLE]
Now note, from Lemma 3, that the mapping is a max-norm contraction. Also, by the definition of , we have that is measurable . Further,
[TABLE]
Finally, as is independent of , we have
[TABLE]
where and D=3\big{(}\alpha^{2}w^{2}+(1-w)^{2}\big{)}. Here the first inequality follows from the fact:
[TABLE]
The second inequality follows from the following facts:
[TABLE]
Therefore by Theorem 1, with probability 1, the generalized minimax Q-learning iterates converge. By virtue of Lemma 6, our proposed minimax Q-learning algorithm computes a policy whose value is the min-max value of the Markov game. ∎
IV-A Extension to the asynchronous setting
In the setting considered above, the updates are synchronous, i.e., Q-values of all state-action pairs are updated at every iteration. However, in the case of online settings, only a single sample is obtained through the interaction with the environment. In the following, we describe the convergence of our algorithm in the asynchronous settings. The following assumption on the structure of probability transition matrix and the control policies [15, Page 130] is necessary in the asynchronous setting:
Assumption 1**.**
The Markov chain induced by all the control policies is ergodic. Moreover, under each policy, every action can be picked with a positive probability in any state.
The latter requirement in Assumption 1 is satisfied for instance by policies such as greedy, see [17]. We first state a result from [18, Theorem 3] and apply it to show the convergence of our proposed algorithm. Let be an infinite subset of and let be the sequence generated by the iteration
[TABLE]
Here, is a vector of possibly outdated components of . In particular, we let
[TABLE]
where each is an integer satisfying representing the delay in information about component available while updating component at time . If then this reduces to the synchronous setting.
Let \mathcal{F}_{n}=\sigma\big{\{}r_{0}(i),\cdots,r_{n}(i),\gamma_{0}(i),\cdots,\gamma_{n}(i),\tau^{i}_{j}(0),\cdots,\tau^{i}_{j}(n),\\ N_{0}(i),\cdots,N_{n-1}(i),\leavevmode\nobreak\ 1\leq i,j\leq m\big{\}}. It is important to note from the construction of that, the step-size sequences are in general allowed to be random. Thus, the component to be updated at time can be decided online based on the history until time .
Assumption 2**.**
For any and , with probability 1.
Assumption 3**.**
For every and , is measurable and .
Assumption 4**.**
**
Assumption 5**.**
The step-sizes are non-negative and satisfy
[TABLE]
Assumption 6**.**
There exists a vector , a positive vector , a scalar , such that
[TABLE]
Theorem 3**.**
Under Assumptions 2-6, converges to the unique fixed point of , with probability 1.
Theorem 4**.**
Consider a finite state-action two-player zero-sum Markov game with bounded payoffs i.e. . Let the sample at iteration be . Then, under Assumption 1, the asynchronous generalized minimax Q-learning algorithm given by the update rule:
[TABLE]
converges with probability 1 to for all .
Proof.
Assumption 2 is trivially satisfied as there is no delay in information during the training. Hence . Assumptions 3 and 4 are shown in (16) and (17), respectively. In order for Assumption 5 to be true, all state and action pairs have to be visited infinitely often, which is ensured through Assumption 1. Finally, from Lemma 3,
[TABLE]
However, as is the unique fixed point, we have,
[TABLE]
thereby proving Assumption 6. Therefore, by Theorem 3, with probability 1, the asynchronous generalized minimax Q-learning iterates converge to . ∎
V Relation between Generalized Minimax Q-learning and standard Minimax Q-learning
In this section, we describe the relation between our proposed Generalized Minimax Q-learning and standard Minimax Q-learning algorithms. For the given two-player zero-sum Markov game , we construct a new game as follows:
- •
- •
and for a given , let be defined as
[TABLE]
where . We note that is a probability mass function on .
Now consider the standard minimax -Bellman operator for this game given by, and , where is dimensional matrix with entry as and is given by the equation (3). Note that
[TABLE]
Hence operator of the game is same as the operator defined for the game . Let us consider an iteration of the minimax -learning algorithm on given by
[TABLE]
where is the step-size sequence, , \bar{N}_{n}(i,u,v)=\Big{(}wr(i,u,v)+(1-w+w\alpha)\text{val}[\bar{Q}_{n}(\bar{Y}_{n}(i,u,v))]\Big{)}-\bar{H}\bar{Q}_{n}(i,u,v) and compare it with an iteration of Generalized minimax Q-learning. Since , both algorithms converge to , the fixed point of , and differ only in the per-iterate noise and .
Lemma 7**.**
Suppose are the iterates of Generalized minimax Q-learning. Then given any there exists a natural number that is possibly sample path dependent, such that , for almost surely.
Proof.
Consider the iterates of the minimax Q-learning algorithm with respect to the stochastic game with initial point . Now assume that (induction hypothesis). Then
[TABLE]
Therefore by induction . As the sequences and converge to , given there exists a natural number such that . Moreover . To conclude we have almost surely and here is possibly sample path dependent. This completes the proof. ∎
Remark 4**.**
We invoke the standard Q-learning algorithm on with the initial point chosen such that to prove the Lemma 7. It is also possible to obtain the same desired conclusion by directly utilizing the convergence of the iterates of standard Q-learning algorithm on for any arbitrary initial point .
VI Model-free Generalised Minimax Q-learning
Note that an input to the Algorithm 1 is the relaxation parameter , where is defined in (6). As depends on the transition probability function , it is not possible to choose a valid in the experiments, where we do not have access to probability transition function. In this section, we describe a synchronous version of the model-free generalised minimax Q-learning procedure that mitigates the dependency on the model information.
We maintain a count value (initialised to zero ) that represents the number of times the sample has been encountered until iteration . We define
[TABLE]
with . It is easy to see that
[TABLE]
as , almost surely (from the Strong Law of Large Numbers).
Now, we propose our model-free “generalised minimax Q-learning” by modifying the Step 3 of Algorithm 1 as:
[TABLE]
where the sequence is updated as:
[TABLE]
with .
VI-A Convergence Analysis:
We write the two update equations as follows:
[TABLE]
The function is defined as:
[TABLE]
The sequence defined as:
[TABLE]
is a martingale difference noise sequence with respect to the increasing fields
, satisfying
[TABLE]
where K=\max\Big{\{}3\Big{(}\displaystyle\max_{i,u,v}|r(i,u,v)|\Big{)}^{2},\frac{6\alpha^{2}}{(1-\alpha)^{2}}\Big{\}}. The function is defined as:
[TABLE]
Finally, where is updated as shown in the equation (18). Note that, from (19), we get
[TABLE]
Notice from (22)-(23) that the -recursion in (22) depends on the -update in (23), while the latter is an independent update that does not depend on . Let be the (unique) fixed point of . Note that, from (21), w_{n}\in\Big{[}1,\displaystyle\frac{1}{1-\alpha}\Big{]},\leavevmode\nobreak\ \forall n\geq 0. Therefore, updates are bounded. We now make an assumption on the boundedness of iterates.
Assumption 7**.**
.
In practice, the iterates will satisfy Assumption 7 if they are projected to a prescribed compact set whenever they exit it, see for instance, [19, Chapter 5] for a general setting of projected stochastic approximation. From Lemma 7, the solution . Therefore, we can choose the set such that .
Lemma 8**.**
Functions and are Lipschitz.
Proof.
Consider and . Let . Then,
[TABLE]
Hence, , where . Finally, . Therefore the functions and are Lipschitz. ∎
We now consider the iterates (22)-(23) in a combined form as follows:
[TABLE]
, , . Let be the fixed point of the modified min-max Q-Bellman operator (see (8)) when is used.
Theorem 5**.**
, where , almost surely.
Proof.
The iterates in (26) track the ODE [15, Section 2.2] . Note that iterates drive the iterates but the reverse is not true, i.e, it is a one way coupling of the dynamics. First, consider the ODE . Let . The function exists and is equal to . Moreover, the origin is the unique globally asymptotically stable equilibrium for the ODE
[TABLE]
with serving as an associated Lyapunov function. Further, is the unique globally asymptotically stable equilibrium for the ODE . Therefore, by [15, Theorem 7, Chapter 3 and Theorem 2 - Corollary 4], we have almost surely. The iterates now track the ODE given by . By virtue of Lemma 3, is a contraction. Hence, by Stochastic fixed point analysis [15, Section 10.3], we have , almost surely. ∎
Remark 5**.**
One way to prove the Assumption 7 is to project the iterates onto a prespecified convex and compact set . Under projection, the update equation (eq. (22)), i.e.,
[TABLE]
is replaced with
[TABLE]
where is the projection of onto a compact and convex set such as . Convexity of would ensure that is a unique fixed point for any .
The iterates in (28) track the ODE [19, Chapter 5]
[TABLE]
where the operator for a continuous function is defined as:
[TABLE]
From [15, Theorem 2, Chapter 2], iterates converge to a compact, connected, internally chain transitive, invariant set of the ODE (29). It is easy to see that is an invariant and internally chain transitive (ICT) set of the ODE (29). However, the projection operation will introduce spurious fixed points on the boundary of the set that will also be invariant and ICT sets of the ODE (29). In [15, Chapter 5.4], some practical techniques are discussed to avoid convergence to undesired equilibrium points (boundary points in this case).
VII Experiments and Results
We refer to the Algorithm 1, with , as “Generalised optimal minimax Q-learning” and the model-free algorithm derived in the previous section as “Generalised minimax Q-learning” algorithm in the experiments. We generate a two-player zero-sum Markov game and run all the algorithms for independent episodes in each of the three cases - (a). states and actions for each of the agents, (b). states and actions for each of the agents and (c). states and actions for each of the agents. The discount factor is set to . The probability transition matrix generated satisfies as this condition is required for faster performance of the generalized optimal minimax Q-learning and generalized minimax Q-learning. All the algorithms are run for iterations in each episode with the same step-size sequences.
The comparison criterion considered is the average error that is calculated as follows. At the end of each episode of the algorithm, the norm difference between estimate of the min-max value function and the actual min-max value function is computed. This process is repeated for all the episodes and the average is computed. Thus,
[TABLE]
where is the min-max value function of the game and is the minimax Q-value function estimate obtained at the end of episode.
In Table I, we report the average error of three algorithms. We can see that, generalized optimal minimax Q-learning has the least average error, followed by the generalized minimax Q-leaning algorithm. This is expected as the generalized optimal Q-learning algorithm makes use of the optimal relaxation parameter in its updates, which is not practically feasible. Therefore, we conclude that our proposed generalized minimax Q-learning algorithms perform empirically better (in terms of number of samples) than the standard minimax Q-learning algorithm.
VIII Conclusions
In this work, we use the technique of successive relaxation to propose a modified min-max Bellman operator for two-player zero-sum games. We prove that the contraction factor of this modified min-max Bellman operator is less than the discount factor (contraction of the standard min-max Bellman operator) for the choice of . The construction of the modified Q-Bellman operator enabled us to develop a generalized minimax Q-learning algorithm. We show the almost sure convergence of our proposed algorithm. We then derive a relation between our proposed algorithm and the standard minimax Q-learning algorithm. We also propose a model-free (from samples) version of our algorithm and prove its convergence under the boundedness of iterates assumption. In the future, we would like to incorporate function approximation architecture and apply our proposed algorithm on practical applications. Moreover, as a future work, we would like to explore the theoretical sample complexity of our algorithm and compare the same with minimax Q-learning.
IX Acknowledgements
Raghuram Bharadwaj was supported by a fellowship grant from the Centre for Networked Intelligence (a Cisco CSR initiative) of the Indian Institute of Science, Bangalore. Shalabh Bhatnagar was supported by the J.C.Bose Fellowship, a project from DST under the ICPS Program and the RBCCPS, IISc.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. P. Bertsekas, Dynamic programming and optimal control . Athena scientific Belmont, MA, 2013, vol. 2.
- 2[2] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994 . Elsevier, 1994, pp. 157–163.
- 3[3] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research , vol. 4, no. Nov, pp. 1039–1069, 2003.
- 4[4] M. L. Littman, “Friend-or-foe Q-learning in general-sum games,” in ICML , vol. 1, 2001, pp. 322–328.
- 5[5] A. Greenwald, K. Hall, and R. Serrano, “Correlated Q-learning,” in ICML , vol. 3, 2003, pp. 242–249.
- 6[6] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence , vol. 17, no. 1. Lawrence Erlbaum Associates Ltd, 2001, pp. 1021–1026.
- 7[7] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , vol. 38, no. 2, pp. 156–172, 2008.
- 8[8] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” ar Xiv preprint ar Xiv:1911.10635 , 2019.
