A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum   Stochastic Games

Raghuram Bharadwaj Diddigi; Chandramouli Kamanchi; Shalabh Bhatnagar

arXiv:1906.06659·cs.LG·March 21, 2022

A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar

PDF

TL;DR

This paper introduces a generalized minimax Q-learning algorithm for two-player zero-sum stochastic games, extending successive relaxation techniques to improve computation speed and proving its convergence without known model information.

Contribution

The paper develops a novel generalized minimax Q-learning algorithm for zero-sum games, extending successive relaxation methods and providing convergence proof under stochastic approximation.

Findings

01

Faster computation of min-max values under certain game structures

02

Convergence of the proposed algorithm is proven using stochastic approximation

03

Experimental results demonstrate the algorithm's effectiveness

Abstract

We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the…

Tables1

Table 1. Table I : Comparison of Error among three algorithms averaged across 50 50 50 episodes

Algorithm

10 states

20 states

50 states

Standard minimax

Q-learning

0.68

\pm

0.07

1.67

\pm

0.13

3.99

\pm

0.11

Generalized minimax

Q-learning

0.49

\pm

0.08

1.43

\pm

0.18

3.75

\pm

0.12

Generalized optimal

minimax Q-learning

0.35

\pm

0.08

1.26

\pm

0.19

3.59

\pm

0.14

Equations214

\displaystyle\displaystyle\min_{\pi_{2}}\max_{\pi_{1}}\sum_{t=0}^{\infty}\mathbb{E}\Big{[}\sum_{u=1}^{|U|}\sum_{v=1}^{|V|}\alpha^{t}x^{u}_{t}y^{v}_{t}r(s_{t},u,v)\mid s_{0}=i\Big{]},

\displaystyle\displaystyle\min_{\pi_{2}}\max_{\pi_{1}}\sum_{t=0}^{\infty}\mathbb{E}\Big{[}\sum_{u=1}^{|U|}\sum_{v=1}^{|V|}\alpha^{t}x^{u}_{t}y^{v}_{t}r(s_{t},u,v)\mid s_{0}=i\Big{]},

J (i) = val [Q (i)], \leavevmode \forall i \in S,

J (i) = val [Q (i)], \leavevmode \forall i \in S,

val [A] = y min x max x^{T} A y,

val [A] = y min x max x^{T} A y,

J = T J,

J = T J,

(T J) (i) = val [Q (i)], \leavevmode \forall i \in S .

(T J) (i) = val [Q (i)], \leavevmode \forall i \in S .

\tilde{J} (i) = val [Q_{N} (i)]

\tilde{J} (i) = val [Q_{N} (i)]

(\tilde{π}_{1} (i), \tilde{π}_{2} (i)) \in ar g val [Q_{N} (i)] .

(\tilde{π}_{1} (i), \tilde{π}_{2} (i)) \in ar g val [Q_{N} (i)] .

∣ val [B] - val [C] ∣ \leq i, j max ∣ b_{ij} - c_{ij} ∣ = ∥ B - C ∥

∣ val [B] - val [C] ∣ \leq i, j max ∣ b_{ij} - c_{ij} ∣ = ∥ B - C ∥

val [B] - val [C] =

val [B] - val [C] =

\leq

=

\leq

\leq

\leq

=

val [B] - val [C] =

val [B] - val [C] =

\geq

\geq

=

=

\geq

\geq

=

∣ val [B] ∣ \leq ∥ B ∥.

∣ val [B] ∣ \leq ∥ B ∥.

∣ val [B] ∣

∣ val [B] ∣

val [β A + k E] = β val [A] + k .

val [β A + k E] = β val [A] + k .

val [β A + k E]

val [β A + k E]

= β y min x max x^{T} A y + k \leavevmode (since x^{T} E y = 1)

= β val [A] + k . \qed

J (i) = val [Q (i)], \leavevmode \forall i \in S,

J (i) = val [Q (i)], \leavevmode \forall i \in S,

Q (i, u, v) = r (i, u, v) + α j \in S \sum p (j ∣ i, u, v) J (j),

Q (i, u, v) = r (i, u, v) + α j \in S \sum p (j ∣ i, u, v) J (j),

\displaystyle w^{*}\triangleq\displaystyle\min_{i,u,v}\Bigg{\{}\frac{1}{1-\alpha p(i|i,u,v)}\Bigg{\}}.

\displaystyle w^{*}\triangleq\displaystyle\min_{i,u,v}\Bigg{\{}\frac{1}{1-\alpha p(i|i,u,v)}\Bigg{\}}.

(T_{w} J) (i) = w \leavevmode (T J) (i) + (1 - w) J (i),

(T_{w} J) (i) = w \leavevmode (T J) (i) + (1 - w) J (i),

(T_{w} J^{*}) (i)

(T_{w} J^{*}) (i)

= w J^{*} (i) + (1 - w) J^{*} (i)

= J^{*} (i) .

Q^{†} (i, u, v) :=

Q^{†} (i, u, v) :=

+ (1 - w) J^{*} (i) .

\displaystyle Q^{*}(i,u,v)=\bigg{(}r(i,u,v)+\alpha\sum^{M}_{j=1}p(j|i,u,v)J^{*}(j)\bigg{)}

\displaystyle Q^{*}(i,u,v)=\bigg{(}r(i,u,v)+\alpha\sum^{M}_{j=1}p(j|i,u,v)J^{*}(j)\bigg{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsQ-Learning

Full text

A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar The authors are with the Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560012, India (e-mails: [email protected]; [email protected]; [email protected]).

Abstract

We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques, under an assumption on the boundedness of iterates. Through experiments, we demonstrate the effectiveness of our proposed algorithm.

0018-9286 ©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. A version of this paper is accepted for publication at the IEEE Transactions on Automatic Control. DOI: 10.1109/TAC.2022.3159453

I Introduction

In two-player zero-sum games, there are two agents that are competing against each other in a common environment. Based on the actions taken by the agents, they receive a payoff corresponding to the current state and the environment transitions to the next state. The objective of an agent (say agent 1) is to compute a sequence of actions starting from a given state to maximize the total discounted payoff. On the other hand, the objective of the second agent (agent 2) is to compute a sequence of actions that minimizes the total discounted payoff. This problem is formulated as a Markov game and the value that is obtained as the min-max of the total expected discounted payoff starting from state $i$ is called the min-max value of state $i$ . The policies that achieve this min-max value are the optimal policies of the agents.

When the model information of the environment is known, a Bellman operator for the two-player zero-sum game [1] is constructed and a fixed point iteration scheme analogous to the value iteration is used to compute the min-max value. However, in most two-player zero-sum game settings, the model information is assumed unknown to the players and the objective is to compute optimal policies utilizing the state and payoff samples obtained from the environment.

In our work, we construct a modified min-max Q-Bellman operator by using the technique of successive relaxation for the Markov games and prove that the contraction factor is at most the contraction factor of the standard min-max Bellman operator. This implies that, when the model information is known, the min-max value can be computed faster using our proposed scheme. We then proceed to develop a generalized minimax Q-learning algorithm based on the modified min-max Q-Bellman operator.

The minimax Q-learning algorithm has been presented in [2]. Two-player general sum games are those where the payoffs of the agents are unrelated in general. If the payoff of an agent is the negative of the payoff of another agent, the game reduces to a zero-sum game. A Nash Q-learning algorithm for solving general sum games is proposed in [3]. In [4], Friend-or-Foe (FF) Q-learning for general sum games is proposed and is shown to have stronger convergence properties compared to Nash Q-learning. A generalization of Nash Q-learning and FF Q-learning, namely correlated Q-learning, is discussed in [5]. In [6], desirable properties for an agent learning in multi-agent scenarios are studied and a new learning algorithm namely “WoLF policy hill climbing” is proposed. Surveys of algorithms for multi-agent learning and multi-agent Reinforcement learning are provided in [7, 8].

We now discuss some variants of minimax Q-learning in the literature. In [9], the minimax TD-learning algorithm that utilizes the concept of temporal difference learning is proposed. The minimax version of the Deep Deterministic policy gradient algorithm has been recently developed in [10]. However, no convergence proofs or theoretical guarantees are provided.

The concept of successive relaxation in the context of Markov Decision Processes (MDPs) has been first applied in [11]. In our recent work [12], we have proposed successive over-relaxation Q-learning for model-free MDPs (i.e., in the single-agent scenario). In this work, we extend the concept of successive relaxation to the two-player zero-sum games and propose a provably convergent generalized minimax Q-learning. The contributions of the paper are as follows:

•

We present a modified min-max Q-Bellman operator for two-player zero-sum Markov games and show that the operator is a max-norm contraction.

•

We show that under some assumptions, the contraction factor of the modified min-max Q-Bellman operator is smaller than the standard min-max Q-Bellman operator.

•

We propose a model-free generalized minimax Q-learning algorithm and prove its almost sure convergence using ODE based analysis of stochastic approximation, under an assumption on the boundedness of iterates

•

We discuss an interesting relation between standard minimax Q-learning and our proposed algorithm.

•

Finally, through experimental evaluation, we show that our proposed algorithm has a better performance compared to the standard minimax Q-learning algorithm.

We note here that the Successive Over Relaxation (SOR) technique utilized to derive our algorithm and stochastic approximation arguments employed in the convergence analysis are well-known in the literature. Our contribution comprises of applying these techniques to derive and analyze a generalized minimax Q-Learning algorithm that has faster convergence.

II Background and Preliminaries

In this paper, we consider the setting of two-player zero-sum Markov games [13]. The two players in the game are referred to as agent 1 and agent 2. A two-player zero-sum Markov game is characterized by the tuple $(S,U,V,p,r,\alpha)$ where $S$ is the set of states that both the agents observe, $U$ is the finite set of actions of agent 1, $V$ is the finite set of actions of agent 2, $p$ denotes the transition probability rule, i.e., $p(j|i,u,v)$ denotes the probability of transition to state $j\in S$ from state $i\in S$ when actions $u\in U$ and $v\in V$ are chosen by the agents 1 and 2, respectively. Let $r(i,u,v)$ denote the single-stage payoff obtained by the agent 1 in state $i$ when actions $u$ and $v$ are chosen by agents 1 and 2, respectively. Note that, in the case of a zero-sum Markov game, the payoff of the agent 2 is the negative of the payoff obtained by the agent 1. Also, $0\leq\alpha<1$ denotes the discount factor. The goals of the two agents in the Markov game are to individually learn the optimal policies $\pi_{1}:S\xrightarrow{}\Delta^{|U|}$ and $\pi_{2}:S\xrightarrow{}\Delta^{|V|}$ , respectively, where $\Delta^{d}$ denotes the probability simplex in $\mathbb{R}^{d}$ and $\pi_{1}(i)$ (resp. $\pi_{2}(i)$ ) indicates the probability distribution over actions to be taken by the agent 1 (resp. agent 2) in state $i$ that maximizes (resp. minimizes) the discounted objective given by:

[TABLE]

where $s_{t}$ is the state of the game at time $t$ , $\pi_{1}(s_{t})=(x^{u}_{t})^{|U|}_{u=1}\in\Delta^{|U|},\pi_{2}(s_{t})=(y^{v}_{t})^{|V|}_{v=1}\in\Delta^{|V|},\leavevmode\nobreak\ \forall t\geq 0$ and $\mathbb{E}[.]$ is the expectation taken over the states obtained over time $t=1,\ldots,\infty$ . Let $J^{*}(i)$ denote the min-max value in state $i$ obtained by solving (1). It can be shown ( [14, Chapter 7]) that the min-max value function, $J^{*}$ , satisfies the following fixed point equation in $J\in\mathbb{R}^{|S|}$ , given by:

[TABLE]

where $Q(i)$ is a matrix of size $|U|\times|V|$ , whose $(u,v)^{th}$ entry is given by $Q(i,u,v)=r(i,u,v)+\alpha\displaystyle\sum_{j\in S}p(j|i,u,v)J(j)$ and the function $\text{val}[A]$ , for a given $m\times n$ matrix $A$ , is defined as follows:

[TABLE]

where $x\in\Delta^{m}$ and $y\in\Delta^{n}$ , respectively. The system of equations in (2) can be rewritten as:

[TABLE]

where the operator $T$ , for a given $J\in\mathbb{R}^{|S|}$ , is defined as:

[TABLE]

The operator $T$ and the set of equations (2) are analogous to the Bellman operator and the Bellman optimality condition, respectively, for Markov Decision Processes (MDPs) [14].

III The Proposed Algorithm

We describe a single iteration of the synchronous version [15] of our proposed algorithm in Algorithm 1 below. At each iteration $n$ , Q-values of all the state-action tuple $Q(i,u,v)$ are updated as shown in the step 4 of Algorithm 1.

Remark 1.

Note that the step 3 of Algorithm 1 requires computation of $\text{val}[Q_{n}(.)]$ which is a linear program. Also observe that the generalized minimax Q-learning algorithm only requires an additional computation of $\text{val}[Q_{n}(i_{n})]$ compared to the standard minimax Q-learning.

Remark 2.

Let $Q_{N}$ be the solution obtained by Algorithm 1 upon termination after $N$ iterations. Then the approximate min-max value, $\tilde{J}(i)$ for a given state $i$ is obtained as follows:

[TABLE]

and the corresponding approximate policies of the agents are obtained as:

[TABLE]

IV Convergence Analysis

Let $\Delta^{d}:=\{x\in\mathbb{R}^{d}:x_{i}\geq 0,\sum^{d}_{i=1}x_{i}=1\}$ denote the probability simplex in $\mathbb{R}^{d}$ . For matrix $A\in\mathbb{R}^{m\times n},x\in\Delta^{m}$ and $y\in\Delta^{n}$ , recall that the value of the matrix $A$ is defined as $\text{val}[A]:=\displaystyle\min_{y\in\Delta^{n}}\displaystyle\max_{x\in\Delta^{m}}x^{T}Ay$ . Note that the norm considered in this section is the max-norm, i.e., norm of the vector $x\in\mathbb{R}^{d}$ is $\|x\|:=\displaystyle\max_{1\leq i\leq d}|x(i)|$ . We first derive a few properties of the $\text{val}[.]$ operator that would be used in the subsequent analysis.

Lemma 1.

Suppose $B=[b_{ij}],C=[c_{ij}]\in\mathbb{R}^{m\times n}$ , then

[TABLE]

Proof.

[TABLE]

In the above, $x\in\Delta^{m}$ , and $y\in\Delta^{n}$ , respectively. Similarly,

[TABLE]

Therefore $|\text{val}[B]-\text{val}[C]|\leq\max_{i,j}|b_{ij}-c_{ij}|=\|B-C\|.$ Note the repeated application of the facts $\displaystyle\sum_{i,j}x_{i}y_{j}=1$ , $\sup{(a_{n}+b_{n})}\leq\sup{a_{n}}+\sup{b_{n}}$ in the arguments. This completes the proof. ∎

Corollary 1.

Consider $B=[b_{ij}]\in\mathbb{R}^{m\times n}$ , then

[TABLE]

Proof.

Using Lemma 1 with $C=0$ , we get:

[TABLE]

Lemma 2.

Let $E=[e_{ij}]_{m\times n}$ , where $e_{ij}=1\leavevmode\nobreak\ \forall i,j.$ Then for constants $\beta$ , $k$ $\in$ $\mathbb{R}$ and $A\in\mathbb{R}^{m\times n},$

[TABLE]

Proof.

By definition of the val operator, for $x\in\Delta^{m}$ , $y\in\Delta^{n}$ ,

[TABLE]

Recall that for a given stochastic game $(S,U,V,p,r,\alpha)$ the min-max value function $J^{*}$ satisfies [16] the system of equations,

[TABLE]

where $Q(i)$ is a $|U|\times|V|$ matrix with $(u,v)^{\text{th}}$ entry

[TABLE]

and the system of equations can be reformulated as the fixed point equation, $TJ=J$ , with $T$ being a contraction under the max-norm with contraction factor $\alpha$ .

We define a quantity $w^{*}$ as follows:

[TABLE]

As the probabilities $p(i|i,u,v)\geq 0,\leavevmode\nobreak\ \forall(i,u,v)$ , it is clear that $w^{*}\geq 1$ . For $0<w\leq w^{*}$ , we now define a modified operator $T_{w}:\mathbb{R}^{|S|}\rightarrow\mathbb{R}^{|S|}$ as follows [11]:

[TABLE]

where $w$ represents a prescribed relaxation factor. Note that $T_{w}$ is in general not a convex combination of $T$ and the identity operator $I$ since we allow $w\geq 1$ as $w^{*}\geq 1$ (see above). Let $J^{*}$ denote the min-max value of the Markov game. Therefore, $TJ^{*}=J^{*}$ . Now,

[TABLE]

Therefore, the min-max value function $J^{*}$ is also a fixed point of $T_{w}.$

Next, we derive a modified min-max Q-Bellman operator for the two-player zero-sum game. Let $Q^{\dagger}(i,u,v)$ be defined as follows:

[TABLE]

Now let

[TABLE]

Let $E=[e_{ij}]_{m\times n}$ with $e_{ij}=1,\leavevmode\nobreak\ \forall i,j$ . Then,

[TABLE]

Hence the equation (8) can be rewritten as follows:

[TABLE]

Let $H_{w}:\mathbb{R}^{|S\times U\times V|}\rightarrow\mathbb{R}^{|S\times U\times V|}$ be defined as follows. For $Q\in\mathbb{R}^{|S\times U\times V|}$ ,

[TABLE]

$H_{w}$ is the modified Q-Bellman operator for the two-player zero-sum Markov game.

Lemma 3.

For $0<w\leq w^{*}$ with $w^{*}$ as in (6), the map $H_{w}:\mathbb{R}^{|S\times U\times V|}\rightarrow\mathbb{R}^{|S\times U\times V|}$ is a max-norm contraction and $Q^{\dagger}$ is the unique fixed point of $H_{w}$ .

Proof.

From equation (IV), $Q^{\dagger}$ is a fixed point of $H_{w}$ . Therefore, it is enough to show that $H_{w}$ is a contraction operator (which will also ensure its uniqueness). For $P,Q\in\mathbb{R}^{|S\times U\times V|}$ , we have

[TABLE]

Since the RHS is not a function of $(i,u,v)$ , we have

[TABLE]

Note the use of the assumption $0<w\leq w^{*}$ (with $w^{*}$ as in (6)) in equation (10) that ensures that the term $\big{(}1-w+w\alpha p(i|i,u,v)\big{)}\geq 0$ , to arrive at equation (11). Also equation (12) is obtained by an application of Lemma 1 in equation (11). From the assumptions on $w$ and discount factor $\alpha$ , it is clear that $0\leq(w\alpha+1-w)<1$ . Therefore $H_{w}$ is a max-norm contraction with contraction factor $(w\alpha+1-w)$ and $Q^{\dagger}$ is its unique fixed point. ∎

Lemma 4.

$T_{w}$ * is a contraction with contraction factor $(1-w+w\alpha).$ *

Proof.

The proof is analogous to the proof of the Lemma 3. ∎

Lemma 5.

For $1\leq w\leq w^{*}$ , the contraction factor for the map $H_{w},$

[TABLE]

Proof.

For $1\leq w\leq w^{*}$ , define $f(w)=1-w+\alpha w.$ Let $w_{1}<w_{2}.$ Then $f(w_{1})=1-w_{1}(1-\alpha)>1-w_{2}(1-\alpha)=f(w_{2}).$ Hence $f$ is decreasing. In particular, for $w\in[1,w^{*}]$ , $1-w+\alpha w=f(w)\leq f(1)=\alpha.$ This shows that, if $w^{*}>1$ and $w$ is chosen such that $1<w\leq w^{*}$ , the contraction factor is strictly smaller than $\alpha$ . ∎

Remark 3.

Depending on the choice of $w$ , the following observations can be made about our proposed generalized minimax Q-learning algorithm (refer Algorithm 1).

•

Case I ( $w=1$ ) :* The generalized minimax Q-learning reduces to standard minimax Q-learning.*

•

Case II ( $w<1$ ) :* The contraction factor of $H_{w}$ in this case, $1-w+\alpha w>\alpha$ , giving rise to minimax Q-learning algorithm with slower convergence.*

•

Case III ( $w>1$ ) :* For this choice of $w$ , it is required that $p(i|i,u,v)>0,\leavevmode\nobreak\ \forall(i,u,v)$ (refer equation (6)). Under this condition, as shown in Lemma 5, the contraction factor of $H_{w}$ , $(1-w+\alpha w)<\alpha$ , giving rise to a faster minimax Q-learning algorithm.*

Lemma 6.

Let $Q^{*}(i,u,v)=r(i,u,v)+\alpha\displaystyle\sum_{j\in S}p(j|i,u,v)J^{*}(j)$ and $Q^{\dagger}$ be the fixed point of $H_{w}$ . Then for all $(i,u,v)\in S\times U\times V$ ,

[TABLE]

Moreover, $\text{val}[Q^{\dagger}(i)]=\text{val}[Q^{*}(i)]$ $\forall i\in S$ .

Proof.

By the hypothesis on $Q^{*}$ , $\text{val}[Q^{*}(i)]=TJ^{*}(i)=J^{*}(i)=T_{w}J^{*}(i)=\text{val}[Q^{\dagger}(i)]\leavevmode\nobreak\ \forall i\in S$ . Since $Q^{\dagger}$ is the fixed point of $H_{w}$ , we have

[TABLE]

Therefore

[TABLE]

This completes the proof. ∎

This Lemma is an interesting and important result in our paper. It shows that, even if the standard minimax Q-value iterates and generalized minimax Q-value iterates are not the same for all $(i,u,v)$ tuples, the min-max values at each state given by both the algorithms are equal. Therefore, this lemma states that generalized minimax Q-value iteration computes the min-max value function, which is the goal of the two-player zero-sum Markov game. We now show the convergence of generalized minimax Q-learning (refer Algorithm 1). For this purpose, we first state the following result (Proposition 4.5 on page 157 of [14]) and apply it to show the convergence of our proposed algorithm. We consider $\gamma_{n}(i)$ to be deterministic as with our algorithm, unlike [14] where these are allowed to be random.

Theorem 1.

Let $\{r_{n}\}$ be the sequence generated by the iteration

[TABLE]

•

The step-sizes $\gamma_{n}(i)$ are non-negative and satisfy

[TABLE]

•

The noise terms $N_{n}(i)$ satisfy

–

For every $i$ and $n$ , we have $\mathbb{E}[N_{n}(i)|\mathcal{F}_{n}]=0,$ where

[TABLE]

–

Given any norm $\|.\|$ on $\mathbb{R}^{d}$ , there exist constants $C$ and $D$ such that

[TABLE]

•

The mapping $F:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a max-norm contraction.

Then, $r_{n}$ converges to $r^{*},$ the unique fixed point of $F$ , with probability 1.

Theorem 2.

Given a finite state-action two-player zero-sum Markov game $(S,U,V,p,r,\alpha)$ with bounded payoffs i.e. $|r(i,u,v)|\leq R<\infty,\leavevmode\nobreak\ \forall\leavevmode\nobreak\ (i,u,v)\in S\times U\times V$ , the generalized minimax Q-learning algorithm (see Algorithm 1) given by the update rule:

[TABLE]

converges with probability 1 to $Q^{\dagger}(i,u,v)$ as long as

[TABLE]

for all $(i,u,v)\in S\times U\times V$ .

Proof.

The update rule of the algorithm is given by

[TABLE]

Let $\mathcal{F}_{n}=\sigma\big{(}\{Q_{0},Y_{j},\forall j<n\}\big{)},n\geq 0$ be the associated filtration. Now observe that $Y_{n}(i,u,v)\sim p(.|i,u,v).$ Also, given $(i,u,v)$ , assume that the random variables $Y_{n}(i,u,v),n\geq 0$ are independent. Then the above equation can be rewritten as:

[TABLE]

where

[TABLE]

and

[TABLE]

Now note, from Lemma 3, that the mapping $H_{w}$ is a max-norm contraction. Also, by the definition of $N_{n}$ , we have that $N_{n}$ is $\mathcal{F}_{n+1}-$ measurable $\forall n$ . Further,

[TABLE]

Finally, as $Y_{n}$ is independent of $\mathcal{F}_{n}$ , we have

[TABLE]

where $C=3w^{2}R^{2}$ and $D=3\big{(}\alpha^{2}w^{2}+(1-w)^{2}\big{)}.$ Here the first inequality follows from the fact:

[TABLE]

The second inequality follows from the following facts:

[TABLE]

Therefore by Theorem 1, with probability 1, the generalized minimax Q-learning iterates $Q_{n}$ converge. By virtue of Lemma 6, our proposed minimax Q-learning algorithm computes a policy whose value is the min-max value of the Markov game. ∎

IV-A Extension to the asynchronous setting

In the setting considered above, the updates are synchronous, i.e., Q-values of all state-action pairs are updated at every iteration. However, in the case of online settings, only a single sample is obtained through the interaction with the environment. In the following, we describe the convergence of our algorithm in the asynchronous settings. The following assumption on the structure of probability transition matrix $p$ and the control policies [15, Page 130] is necessary in the asynchronous setting:

Assumption 1.

The Markov chain induced by all the control policies is ergodic. Moreover, under each policy, every action can be picked with a positive probability in any state.

The latter requirement in Assumption 1 is satisfied for instance by policies such as $\epsilon-$ greedy, see [17]. We first state a result from [18, Theorem 3] and apply it to show the convergence of our proposed algorithm. Let $T^{i}$ be an infinite subset of $\mathcal{N}$ and let $\{r_{n}\}\in\mathcal{R}^{m}$ be the sequence generated by the iteration

[TABLE]

Here, $r_{n}(i)$ is a vector of possibly outdated components of $r$ . In particular, we let

[TABLE]

where each $\tau^{i}_{j}(n)$ is an integer satisfying $0\leq\tau^{i}_{j}(n)\leq n$ representing the delay in information about component $j$ available while updating component $i$ at time $n$ . If $\tau^{i}_{j}(n)=n,\leavevmode\nobreak\ \forall i,j$ then this reduces to the synchronous setting.

Let $\mathcal{F}_{n}=\sigma\big{\{}r_{0}(i),\cdots,r_{n}(i),\gamma_{0}(i),\cdots,\gamma_{n}(i),\tau^{i}_{j}(0),\cdots,\tau^{i}_{j}(n),\\ N_{0}(i),\cdots,N_{n-1}(i),\leavevmode\nobreak\ 1\leq i,j\leq m\big{\}}$ . It is important to note from the construction of $\{\mathcal{F}_{n}\}$ that, the step-size sequences $\gamma_{n}(i)$ are in general allowed to be random. Thus, the component to be updated at time $n$ can be decided online based on the history until time $n$ .

Assumption 2.

For any $i$ and $j$ , $\lim_{n\rightarrow\infty}\tau^{i}_{j}(n)=\infty,$ with probability 1.

Assumption 3.

For every $i$ and $n$ , $N_{n}(i)$ is $\mathcal{F}_{n+1}-$ measurable and $\mathbb{E}[N_{n}(i)|\mathcal{F}_{n}]=0$ .

Assumption 4.

$\mathbb{E}[N_{n}^{2}(i)|\mathcal{F}_{n}]\leq C+D\max_{j}\max_{\tau\leq n}|r_{\tau}(j)|^{2},\leavevmode\nobreak\ \forall i,n.$ **

Assumption 5.

The step-sizes $\gamma_{n}(i)$ are non-negative and satisfy

[TABLE]

Assumption 6.

There exists a vector $r^{*}$ , a positive vector $v$ , a scalar $\beta\in[0,1)$ , such that

[TABLE]

Theorem 3.

Under Assumptions 2-6, $r_{n}$ converges to $r^{*},$ the unique fixed point of $F$ , with probability 1.

Theorem 4.

Consider a finite state-action two-player zero-sum Markov game $(S,U,V,p,r,\alpha)$ with bounded payoffs i.e. $|r(i,u,v)|\leq R<\infty,\leavevmode\nobreak\ \forall\leavevmode\nobreak\ (i,u,v)\in S\times U\times V$ . Let the sample at iteration $n$ be $(i_{n},u_{n},v_{n},Y_{n}(i_{n},u_{n},v_{n}))$ . Then, under Assumption 1, the asynchronous generalized minimax Q-learning algorithm given by the update rule:

[TABLE]

converges with probability 1 to $Q^{\dagger}(i,u,v)$ for all $(i,u,v)\in S\times U\times V$ .

Proof.

Assumption 2 is trivially satisfied as there is no delay in information during the training. Hence $\tau_{j}^{i}(n)=n,\leavevmode\nobreak\ \forall i,j$ . Assumptions 3 and 4 are shown in (16) and (17), respectively. In order for Assumption 5 to be true, all state and action pairs have to be visited infinitely often, which is ensured through Assumption 1. Finally, from Lemma 3,

[TABLE]

However, as $Q^{\dagger}$ is the unique fixed point, we have,

[TABLE]

thereby proving Assumption 6. Therefore, by Theorem 3, with probability 1, the asynchronous generalized minimax Q-learning iterates $Q_{n}$ converge to $Q^{\dagger}$ . ∎

V Relation between Generalized Minimax Q-learning and standard Minimax Q-learning

In this section, we describe the relation between our proposed Generalized Minimax Q-learning and standard Minimax Q-learning algorithms. For the given two-player zero-sum Markov game $(S,U,V,p,r,\alpha)$ , we construct a new game $(\bar{S},\bar{U},\bar{V},q,\bar{r},\bar{\alpha})$ as follows:

•

$\bar{S}=S,\bar{U}=U,\bar{V}=V$

•

$\bar{r}=wr,\bar{\alpha}=(1-w+\alpha w)$ and for a given $(i,u,v)$ , let $q(.|i,u,v):S\rightarrow[0,1]$ be defined as

[TABLE]

where $0<w\leq w^{*}$ . We note that $q(.|i,u,v)$ is a probability mass function on $\bar{S}$ .

Now consider the standard minimax $Q$ -Bellman operator $\bar{H}$ for this game given by, $\bar{H}:\mathbb{R}^{\bar{S}\times\bar{U}\times\bar{V}}\rightarrow\mathbb{R}^{\bar{S}\times\bar{U}\times\bar{V}}$ and $\bar{H}Q(i,u,v)=wr(i,u,v)+(1-w+\alpha w)\displaystyle\sum_{j\in\bar{S}}q(j|i,u,v)\text{val}[Q(j)]$ , where $Q(j)$ is $|\bar{U}|\times|\bar{V}|$ dimensional matrix with $(u,v)^{\text{th}}$ entry as $Q(j,u,v).$ and $\text{val}[Q(j)]$ is given by the equation (3). Note that

[TABLE]

Hence $\bar{H}$ operator of the game $(\bar{S},\bar{U},\bar{V},q,\bar{r},\bar{\alpha})$ is same as the $H_{w}$ operator defined for the game $(S,U,V,p,r,\alpha)$ . Let us consider an iteration of the minimax $Q$ -learning algorithm on $(\bar{S},\bar{U},\bar{V},q,\bar{r},\bar{\alpha})$ given by

[TABLE]

where $\gamma_{n},\leavevmode\nobreak\ n\geq 0,$ is the step-size sequence, $\bar{Y}_{n}(i,u,v)\sim q(.|i,u,v)$ , $\bar{N}_{n}(i,u,v)=\Big{(}wr(i,u,v)+(1-w+w\alpha)\text{val}[\bar{Q}_{n}(\bar{Y}_{n}(i,u,v))]\Big{)}-\bar{H}\bar{Q}_{n}(i,u,v)$ and compare it with an iteration of Generalized minimax Q-learning. Since $\bar{H}=H_{w}$ , both algorithms converge to $Q^{\dagger}$ , the fixed point of $H_{w}$ , and differ only in the per-iterate noise $\bar{N}_{n}$ and $N_{n}$ .

Lemma 7.

Suppose $\{Q_{n}\}$ are the iterates of Generalized minimax Q-learning. Then given any $\epsilon>0$ there exists a natural number $N$ that is possibly sample path dependent, such that $\|Q_{n}\|\leq\frac{R}{1-\alpha}+\epsilon$ , for $n>N$ almost surely.

Proof.

Consider the iterates $\bar{Q}_{n}$ of the minimax Q-learning algorithm with respect to the stochastic game $(\bar{S},\bar{U},\bar{V},q,\bar{r},\bar{\alpha})$ with initial point $\|\bar{Q}_{0}\|\leq\frac{R}{1-\alpha}$ . Now assume that $\|\bar{Q}_{n}\|\leq\frac{R}{1-\alpha}$ (induction hypothesis). Then

[TABLE]

Therefore by induction $\|\bar{Q}_{n}\|\leq\frac{R}{1-\alpha},\forall n\geq 0$ . As the sequences $\{\bar{Q}_{n}\}$ and $\{Q_{n}\}$ converge to $Q^{\dagger}$ , given $\epsilon>0$ there exists a natural number $N$ such that $\|Q_{n}-\bar{Q}_{n}\|\leq\epsilon\implies\|Q_{n}\|\leq\frac{R}{1-\alpha}+\epsilon,\forall n>N$ . Moreover $\|Q^{\dagger}\|\leq\frac{R}{1-\alpha}$ . To conclude we have $\|Q_{n}\|\leq\frac{R}{1-\alpha}+\epsilon$ almost surely and $N$ here is possibly sample path dependent. This completes the proof. ∎

Remark 4.

We invoke the standard Q-learning algorithm on $(\bar{S},\bar{U},\bar{V},q,\bar{r},\bar{\alpha})$ with the initial point $\bar{Q}_{0}$ chosen such that $\|\bar{Q}_{0}\|\leq\frac{R}{1-\alpha}$ to prove the Lemma 7. It is also possible to obtain the same desired conclusion by directly utilizing the convergence of the iterates of standard Q-learning algorithm on $(\bar{S},\bar{U},\bar{V},q,\bar{r},\bar{\alpha})$ for any arbitrary initial point $\bar{Q}_{0}$ .

VI Model-free Generalised Minimax Q-learning

Note that an input to the Algorithm 1 is the relaxation parameter $w\leq w^{*}$ , where $w^{*}$ is defined in (6). As $w^{*}$ depends on the transition probability function $p$ , it is not possible to choose a valid $w$ in the experiments, where we do not have access to probability transition function. In this section, we describe a synchronous version of the model-free generalised minimax Q-learning procedure that mitigates the dependency on the model information.

We maintain a count value $C_{n}[i][j][u][v],\leavevmode\nobreak\ \forall i,j\in S,u\in U,v\in V$ (initialised to zero $\leavevmode\nobreak\ \forall i,j,u,v$ ) that represents the number of times the sample $(i,u,v,j)$ has been encountered until iteration $n$ . We define

[TABLE]

with $p^{\prime}_{0}(j|i,u,v)=0,\leavevmode\nobreak\ \forall i,j,u,v$ . It is easy to see that

[TABLE]

as $n\xrightarrow[]{}\infty$ , almost surely (from the Strong Law of Large Numbers).

Now, we propose our model-free “generalised minimax Q-learning” by modifying the Step 3 of Algorithm 1 as:

[TABLE]

where the sequence $\{w_{n},\leavevmode\nobreak\ n\geq 1\}$ is updated as:

[TABLE]

with $w_{0}\in[1,\frac{1}{1-\alpha}]$ .

VI-A Convergence Analysis:

We write the two update equations as follows:

[TABLE]

The function $h$ is defined as:

[TABLE]

The sequence $\{M_{n}\}$ defined as:

[TABLE]

is a martingale difference noise sequence with respect to the increasing $\sigma-$ fields

$\mathcal{F}_{n}:=\{Q_{0},w_{0},M_{0},\ldots,w_{n},M_{n}\}$ , $n\geq 0,$ satisfying

[TABLE]

where $K=\max\Big{\{}3\Big{(}\displaystyle\max_{i,u,v}|r(i,u,v)|\Big{)}^{2},\frac{6\alpha^{2}}{(1-\alpha)^{2}}\Big{\}}$ . The function $g$ is defined as:

[TABLE]

Finally, $\epsilon_{n}=\frac{1}{1-\alpha\displaystyle\min_{i,u,v}p^{\prime}_{n}(i|i,u,v)}-\frac{1}{1-\alpha\displaystyle\min_{i,u,v}p(i|i,u,v)},$ where $p^{\prime}_{n}$ is updated as shown in the equation (18). Note that, from (19), we get

[TABLE]

Notice from (22)-(23) that the $Q_{n}$ -recursion in (22) depends on the $w_{n}$ -update in (23), while the latter is an independent update that does not depend on $Q_{n}$ . Let $Q_{w^{*}}^{\dagger}$ be the (unique) fixed point of $H_{w^{*}}$ . Note that, from (21), $w_{n}\in\Big{[}1,\displaystyle\frac{1}{1-\alpha}\Big{]},\leavevmode\nobreak\ \forall n\geq 0$ . Therefore, $\{w_{n},\leavevmode\nobreak\ \forall n\geq 1\}$ updates are bounded. We now make an assumption on the boundedness of $\{Q_{n}\}$ iterates.

Assumption 7.

$\|Q_{n}\|\leq B<\infty,\leavevmode\nobreak\ \forall n\geq 0$ .

In practice, the iterates $\{Q_{n}\}$ will satisfy Assumption 7 if they are projected to a prescribed compact set $\Omega$ whenever they exit it, see for instance, [19, Chapter 5] for a general setting of projected stochastic approximation. From Lemma 7, the solution $\|Q^{\dagger}_{w^{*}}\|\leq\frac{R}{1-\alpha}$ . Therefore, we can choose the set $\Omega$ such that $\|\Omega\|:=\displaystyle\max_{x\in\Omega}\|x\|>\frac{R}{1-\alpha}$ .

Lemma 8.

Functions $h(Q,w)$ and $g(w)$ are Lipschitz.

Proof.

Consider $p,q\in[-B,B]^{|S|\times|U|\times|V|}$ and $w_{1},w_{2}\in[1,\frac{1}{1-\alpha}]$ . Let $R=\displaystyle\max_{i,u,v}|r(i,u,v)|$ . Then,

[TABLE]

Hence, $\|h(p,w_{1})-h(q,w_{2})\|\leq L(\|p-q\|+|w_{1}-w_{2}|)$ , where $L=\displaystyle\max\left\{\frac{1+\alpha}{1-\alpha},R+2B\right\}$ . Finally, $|g(w_{1})-g(w_{2})|\leq|w_{1}-w_{2}|$ . Therefore the functions $h(Q,w)$ and $g(w)$ are Lipschitz. ∎

We now consider the iterates (22)-(23) in a combined form as follows:

[TABLE]

$x_{n}=\left(Q_{n},w_{n}\right)^{T}$ , $f(x_{n})=\left(H_{w_{n}}(Q_{n})-Q_{n},w^{*}-w_{n}\right)^{T}$ , $M^{\prime}_{n+1}=\left(M_{n+1},0\right)^{T},$ $\epsilon^{\prime}_{n}=\left(0,\epsilon_{n}\right)^{T}$ . Let $Q^{\dagger}_{w^{*}}$ be the fixed point of the modified min-max Q-Bellman operator (see (8)) when $w=w^{*}$ is used.

Theorem 5.

$x_{n}\xrightarrow[]{}x^{*}$ , where $x^{*}=\left(Q^{\dagger}_{w^{*}},w^{*}\right)^{T}$ , almost surely.

Proof.

The iterates $\{x_{n}\}$ in (26) track the ODE [15, Section 2.2] $\dot{x}=f(x)=\left(H_{w}(Q)-Q,w^{*}-w\right)^{T}$ . Note that $w_{n}$ iterates drive the $Q_{n}$ iterates but the reverse is not true, i.e, it is a one way coupling of the dynamics. First, consider the ODE $\dot{w}=w^{*}-w$ . Let $g_{\infty}(w)=\displaystyle\lim_{r\xrightarrow[]{}\infty}\frac{g(rw)}{r}$ . The function $g_{\infty}(w)$ exists and is equal to $-w$ . Moreover, the origin is the unique globally asymptotically stable equilibrium for the ODE

[TABLE]

with $V(w)=\frac{w^{2}}{2}$ serving as an associated Lyapunov function. Further, $w^{*}$ is the unique globally asymptotically stable equilibrium for the ODE $\dot{w}=w^{*}-w$ . Therefore, by [15, Theorem 7, Chapter 3 and Theorem 2 - Corollary 4], we have $w_{n}\xrightarrow[]{}w^{*}$ almost surely. The $Q_{n}$ iterates now track the ODE given by $\dot{Q}=H_{w^{*}}(Q)-Q$ . By virtue of Lemma 3, $H_{w}^{*}$ is a contraction. Hence, by Stochastic fixed point analysis [15, Section 10.3], we have $Q_{n}\xrightarrow{}Q^{\dagger}_{w^{*}}$ , almost surely. ∎

Remark 5.

One way to prove the Assumption 7 is to project the $\{Q_{n}\}$ iterates onto a prespecified convex and compact set $C$ . Under projection, the update equation (eq. (22)), i.e.,

[TABLE]

is replaced with

[TABLE]

where $\Gamma_{C}(P)$ is the projection of $P$ onto a compact and convex set such as $C=[-B,B]^{|S|\times|U|\times|V|}$ . Convexity of $C$ would ensure that $\Gamma_{C}(P)$ is a unique fixed point for any $P$ .

The iterates $\{Q_{n}\}$ in (28) track the ODE [19, Chapter 5]

[TABLE]

where the operator $\hat{\Gamma}(h),$ for a continuous function $h$ is defined as:

[TABLE]

From [15, Theorem 2, Chapter 2], $\{Q_{n}\}$ iterates converge to a compact, connected, internally chain transitive, invariant set of the ODE (29). It is easy to see that $\{Q^{\dagger}_{w^{*}}\}$ is an invariant and internally chain transitive (ICT) set of the ODE (29). However, the projection operation will introduce spurious fixed points on the boundary of the set $C$ that will also be invariant and ICT sets of the ODE (29). In [15, Chapter 5.4], some practical techniques are discussed to avoid convergence to undesired equilibrium points (boundary points in this case).

VII Experiments and Results

We refer to the Algorithm 1, with $w=w^{*}$ , as “Generalised optimal minimax Q-learning” and the model-free algorithm derived in the previous section as “Generalised minimax Q-learning” algorithm in the experiments. We generate a two-player zero-sum Markov game and run all the algorithms for $50$ independent episodes in each of the three cases - (a). $10$ states and $5$ actions for each of the agents, (b). $20$ states and $5$ actions for each of the agents and (c). $50$ states and $5$ actions for each of the agents. The discount factor is set to $0.6$ . The probability transition matrix generated satisfies $p(i|i,u,v)>0\leavevmode\nobreak\ \forall i,u,v$ as this condition is required for faster performance of the generalized optimal minimax Q-learning and generalized minimax Q-learning. All the algorithms are run for $1000$ iterations in each episode with the same step-size sequences.

The comparison criterion considered is the average error that is calculated as follows. At the end of each episode of the algorithm, the norm difference between estimate of the min-max value function and the actual min-max value function is computed. This process is repeated for all the $50$ episodes and the average is computed. Thus,

[TABLE]

where $J^{*}$ is the min-max value function of the game and $Q_{k}(.)$ is the minimax Q-value function estimate obtained at the end of $k^{th}$ episode.

In Table I, we report the average error of three algorithms. We can see that, generalized optimal minimax Q-learning has the least average error, followed by the generalized minimax Q-leaning algorithm. This is expected as the generalized optimal Q-learning algorithm makes use of the optimal relaxation parameter $w^{*}$ in its updates, which is not practically feasible. Therefore, we conclude that our proposed generalized minimax Q-learning algorithms perform empirically better (in terms of number of samples) than the standard minimax Q-learning algorithm.

VIII Conclusions

In this work, we use the technique of successive relaxation to propose a modified min-max Bellman operator for two-player zero-sum games. We prove that the contraction factor of this modified min-max Bellman operator is less than the discount factor (contraction of the standard min-max Bellman operator) for the choice of $w>1$ . The construction of the modified Q-Bellman operator enabled us to develop a generalized minimax Q-learning algorithm. We show the almost sure convergence of our proposed algorithm. We then derive a relation between our proposed algorithm and the standard minimax Q-learning algorithm. We also propose a model-free (from samples) version of our algorithm and prove its convergence under the boundedness of iterates assumption. In the future, we would like to incorporate function approximation architecture and apply our proposed algorithm on practical applications. Moreover, as a future work, we would like to explore the theoretical sample complexity of our algorithm and compare the same with minimax Q-learning.

IX Acknowledgements

Raghuram Bharadwaj was supported by a fellowship grant from the Centre for Networked Intelligence (a Cisco CSR initiative) of the Indian Institute of Science, Bangalore. Shalabh Bhatnagar was supported by the J.C.Bose Fellowship, a project from DST under the ICPS Program and the RBCCPS, IISc.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. P. Bertsekas, Dynamic programming and optimal control . Athena scientific Belmont, MA, 2013, vol. 2.
2[2] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994 . Elsevier, 1994, pp. 157–163.
3[3] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research , vol. 4, no. Nov, pp. 1039–1069, 2003.
4[4] M. L. Littman, “Friend-or-foe Q-learning in general-sum games,” in ICML , vol. 1, 2001, pp. 322–328.
5[5] A. Greenwald, K. Hall, and R. Serrano, “Correlated Q-learning,” in ICML , vol. 3, 2003, pp. 242–249.
6[6] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence , vol. 17, no. 1. Lawrence Erlbaum Associates Ltd, 2001, pp. 1021–1026.
7[7] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , vol. 38, no. 2, pp. 156–172, 2008.
8[8] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” ar Xiv preprint ar Xiv:1911.10635 , 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

Abstract

I Introduction

II Background and Preliminaries

III The Proposed Algorithm

Remark 1**.**

Remark 2**.**

IV Convergence Analysis

Lemma 1**.**

Proof.

Corollary 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Remark 3**.**

Lemma 6**.**

Proof.

Theorem 1**.**

Theorem 2**.**

Proof.

IV-A Extension to the asynchronous setting

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

Assumption 5**.**

Assumption 6**.**

Theorem 3**.**

Theorem 4**.**

Proof.

V Relation between Generalized Minimax Q-learning and standard Minimax Q-learning

Lemma 7**.**

Proof.

Remark 4**.**

VI Model-free Generalised Minimax Q-learning

VI-A Convergence Analysis:

Assumption 7**.**

Lemma 8**.**

Proof.

Theorem 5**.**

Proof.

Remark 5**.**

VII Experiments and Results

VIII Conclusions

IX Acknowledgements

Remark 1.

Remark 2.

Lemma 1.

Corollary 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Remark 3.

Lemma 6.

Theorem 1.

Theorem 2.

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Theorem 3.

Theorem 4.

Lemma 7.

Remark 4.

Assumption 7.

Lemma 8.

Theorem 5.

Remark 5.