Learning Nash Equilibria in Monotone Games

Tatiana Tatarenko; Maryam Kamgarpour

arXiv:1904.01882·cs.MA·April 4, 2019

Learning Nash Equilibria in Monotone Games

Tatiana Tatarenko, Maryam Kamgarpour

PDF

TL;DR

This paper introduces a distributed algorithm for learning Nash equilibria in monotone games, requiring only local cost evaluations and converging under mild monotonicity assumptions, broadening applicability.

Contribution

It presents a novel distributed method that guarantees convergence in monotone games without needing strong monotonicity, unlike previous algorithms.

Findings

01

Algorithm converges under mere monotonicity.

02

Applicable to games with linear coupling constraints.

03

Broadens the class of games where Nash equilibria can be learned.

Abstract

We consider multi-agent decision making where each agent's cost function depends on all agents' strategies. We propose a distributed algorithm to learn a Nash equilibrium, whereby each agent uses only obtained values of her cost function at each joint played action, lacking any information of the functional form of her cost or other agents' costs or strategy sets. In contrast to past work where convergent algorithms required strong monotonicity, we prove algorithm convergence under mere monotonicity assumption. This significantly widens algorithm's applicability, such as to games with linear coupling constraints.

Equations184

M (a)

M (a)

\mbox w h er e M_{i} (a) = (M_{i, 1} (a), \dots, M_{i, d} (a))^{⊤}, \mbox an d

M_{i, k} (a)

J_{i} (a^{i *}, a^{- i *}) \leq J_{i} (a^{i}, a^{- i *}) .

J_{i} (a^{i *}, a^{- i *}) \leq J_{i} (a^{i}, a^{- i *}) .

p_{i}

p_{i}

= \frac{1}{( 2 π σ ( t + 1 ) ) ^{d}} exp {- k = 1 \sum d \frac{( x _{k}^{i} - μ _{k}^{i} ( t + 1 ) ) ^{2}}{2 σ ^{2} ( t + 1 )}} .

\displaystyle\boldsymbol{\mu}^{i}(t+1)=\mbox{Proj}_{A_{i}}\big{[}\boldsymbol{\mu}^{i}(t)

\displaystyle\boldsymbol{\mu}^{i}(t+1)=\mbox{Proj}_{A_{i}}\big{[}\boldsymbol{\mu}^{i}(t)

\displaystyle-\gamma(t)\sigma^{2}(t)\left({\hat{J}_{i}(t)}\frac{{\mathbf{x}^{i}(t)}-\boldsymbol{\mu}^{i}(t)}{\sigma^{2}(t)}+\epsilon(t)\boldsymbol{\mu}^{i}(t)\right)\big{]}.

\tilde{J}_{i}

\tilde{J}_{i}

\frac{\partial J ~ _{i} ( μ ( t ) , σ ( t ))}{\partial μ _{k}^{i}} = E_{x (t)} {\hat{J}_{i} (t) \frac{x _{k}^{i} ( t ) - μ _{k}^{i} ( t )}{σ ^{2} ( t )}}

\frac{\partial J ~ _{i} ( μ ( t ) , σ ( t ))}{\partial μ _{k}^{i}} = E_{x (t)} {\hat{J}_{i} (t) \frac{x _{k}^{i} ( t ) - μ _{k}^{i} ( t )}{σ ^{2} ( t )}}

=

x_{k}^{i} (t) \sim N (μ_{k}^{i} (t), σ (t)), i \in [N], k \in [d]} .

\frac{1}{σ ^{2}} \int_{R^{N d}} J_{i} (x) (x_{k}^{i} - μ_{k}^{i}) p (μ, x, σ) d x .

\frac{1}{σ ^{2}} \int_{R^{N d}} J_{i} (x) (x_{k}^{i} - μ_{k}^{i}) p (μ, x, σ) d x .

\int_{R^{N d}} J_{i} (x) (x_{k}^{i} - μ_{k}^{i}) p (μ, x, σ) d x

\int_{R^{N d}} J_{i} (x) (x_{k}^{i} - μ_{k}^{i}) p (μ, x, σ) d x

= \int_{R^{N d}} [J_{i} (μ (i, k))

+ \frac{\partial J _{i} ( η ( x , μ ))}{\partial x _{k}^{i}} (x_{k}^{i} - μ_{k}^{i})] (x_{k}^{i} - μ_{k}^{i}) p (μ, x, σ) d x

= \int_{R^{N d}} \frac{\partial J _{i} ( η ( x , μ ))}{\partial x _{k}^{i}} (x_{k}^{i} - μ_{k}^{i})^{2} p (μ, x, σ) d x

= \int_{R^{N d}} \frac{\partial J _{i} ( η _{1} ( y , μ ))}{\partial x _{k}^{i}} (y_{k}^{i})^{2} p (0, y, σ) d y,

∣ \frac{\partial J _{i} ( η _{1} ( y , μ ))}{\partial x _{k}^{i}} (y_{k}^{i})^{2} p (0, y, σ) ∣ \leq h (y) = l (y_{k}^{i})^{2} p (0, y, σ),

∣ \frac{\partial J _{i} ( η _{1} ( y , μ ))}{\partial x _{k}^{i}} (y_{k}^{i})^{2} p (0, y, σ) ∣ \leq h (y) = l (y_{k}^{i})^{2} p (0, y, σ),

μ^{i} (t + 1) = \mbox P r o j_{A_{i}} [μ^{i} (t) - γ (t) σ^{2} (t)

μ^{i} (t + 1) = \mbox P r o j_{A_{i}} [μ^{i} (t) - γ (t) σ^{2} (t)

\displaystyle\times\big{(}\boldsymbol{M}_{i}(\boldsymbol{\mu}(t))+\boldsymbol{Q}_{i}(\boldsymbol{\mu}(t),\sigma(t))+\boldsymbol{R}_{i}(\boldsymbol{\mu}(t),\mathbf{x}(t),\sigma(t))

\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+\epsilon(t))\boldsymbol{\mu}^{i}(t)\big{)},

Q_{i} (μ (t), σ (t)) = \tilde{M}_{i} (μ (t), σ (t)) - M_{i} (μ (t)),

Q_{i} (μ (t), σ (t)) = \tilde{M}_{i} (μ (t), σ (t)) - M_{i} (μ (t)),

R_{i} (x (t), μ (t), σ (t)) = F_{i} (x (t), μ (t), σ (t)) - \tilde{M}_{i} (μ (t), σ (t)),

F_{i} (x (t), μ (t), σ (t)) = \hat{J}_{i} (t) \frac{x ^{i} ( t ) - μ ^{i} ( t )}{σ ^{2} ( t )},

\displaystyle\tilde{M}_{i,k}(\boldsymbol{\mu}(t),\sigma(t))=\frac{\partial{\tilde{J}_{i}(\boldsymbol{\mu}(t),\sigma(t))}}{\partial\mu^{i}_{k}},\mbox{ for $k\in[d]$}.

\displaystyle\tilde{M}_{i,k}(\boldsymbol{\mu}(t),\sigma(t))=\frac{\partial{\tilde{J}_{i}(\boldsymbol{\mu}(t),\sigma(t))}}{\partial\mu^{i}_{k}},\mbox{ for $k\in[d]$}.

Q (μ (t), σ (t)) = (Q_{1} (μ (t), σ (t)), \dots,

Q (μ (t), σ (t)) = (Q_{1} (μ (t), σ (t)), \dots,

R (x (t), μ (t), σ (t)) = (

R (x (t), μ (t), σ (t)) = (

R_{N} (x (t), μ (t), σ (t)))

R_{i} (x (t), μ (t), σ (t)) = F_{i} (x (t), μ (t), σ (t))

R_{i} (x (t), μ (t), σ (t)) = F_{i} (x (t), μ (t), σ (t))

- E_{x (t)} {F_{i} (x (t), μ (t), σ (t))}, i \in [N] .

\tilde{M}_{i} (

\tilde{M}_{i} (

y (t) \in S O L (A, M (y) + ϵ (t) y) .

y (t) \in S O L (A, M (y) + ϵ (t) y) .

∥ y (t) - y (t - 1) ∥ \leq M_{y} \frac{∣ ϵ ( t - 1 ) - ϵ ( t ) ∣}{ϵ ( t )}, \forall t \geq 1,

∥ y (t) - y (t - 1) ∥ \leq M_{y} \frac{∣ ϵ ( t - 1 ) - ϵ ( t ) ∣}{ϵ ( t )}, \forall t \geq 1,

∥ μ - y (t) ∥

∥ μ - y (t) ∥

\leq ∥ μ - y (t - 1) ∥ + M_{y} \frac{∣ ϵ ( t - 1 ) - ϵ ( t ) ∣}{ϵ ( t )},

2 ab \leq θ a^{2} + \frac{b ^{2}}{θ},

2 ab \leq θ a^{2} + \frac{b ^{2}}{θ},

∥ μ - y (t) ∥^{2} \leq

∥ μ - y (t) ∥^{2} \leq

+ (1 + \frac{1}{θ}) M_{y}^{2} \frac{∣ ϵ ( t - 1 ) - ϵ ( t ) ∣ ^{2}}{ϵ ^{2} ( t )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning Nash Equilibria in Monotone Games

Tatiana Tatarenko and Maryam Kamgarpour, IEEE Member T. Tatarenko ([email protected]) is with the Control Methods and Robotics Lab Technical University Darmstadt, Darmstadt, Germany 64283, M. Kamgarpour ([email protected]) is with the Automatic Control Laboratory, ETH Zürich, Switzerland. M. Kamgarpour gratefully acknowledges ERC Starting Grant CONENE.

Abstract

We consider multi-agent decision making where each agent’s cost function depends on all agents’ strategies. We propose a distributed algorithm to learn a Nash equilibrium, whereby each agent uses only obtained values of her cost function at each joint played action, lacking any information of the functional form of her cost or other agents’ costs or strategy sets. In contrast to past work where convergent algorithms required strong monotonicity, we prove algorithm convergence under mere monotonicity assumption. This significantly widens algorithm’s applicability, such as to games with linear coupling constraints.

Index Terms:

learning in games, distributed algorithms

I Introduction

Game theory is a powerful framework for analyzing and optimizing multi-agent decision making problems. In several such problems, each agent (referred to also as a player) does not have full information on her objective function, due to the unknown interactions and other players’ strategies affecting her objective. Consider for example, a transportation network in which an agent’s objective is minimizing travel time or an electricity network in which an agent’s objective is minimizing own’s electricity prices. In these instances, the travel times and prices, respectively, depend non-trivially on the strategies of other agents. Motivated by this limited information setup, we consider computing Nash equilibria given only the so-called payoff-based information. That is, each player can only observe the values of its objective function at a joint played action, does not know the functional form of her or others’ objectives, nor the strategy sets and actions of other players, and cannot communicate with other players. In this setting, we address the question of how agents should update their actions to converge to a Nash equilibrium strategy.

A large body of literature on learning Nash equilibria with payoff-based information has focused on finite action setting or potential games, see for example, [11, 12, 7] and references therein. For games with continuous (uncountable) action spaces, a payoff-based approach was developed based on the extremum seeking idea in optimization [3, 13], and assuming strongly convex objectives almost sure convergence to the Nash equilibrium was proven. A payoff-based approach, inspired by the logit dynamics in finite action games [2] was extended to continuous action setting for the case of potential games [14]. The work in [16] considered learning Nash equilibria in continuous action games on networks. Crucially, the work additionally assumed that each player exchanges information with her neighbors, to facilitate estimation of the gradient of her objective function online.

Recently, we proposed a payoff-based approach to learn Nash equilibria in a class of convex games [15]. Our approach hinged upon connecting Nash equilibria of a game to the solution set of a related variational inequality problem. Our algorithm convergence was established for the cases in which the game mapping is strongly monotone or the game admits a potential function. Apart from possibly limited scope of a potential game, strong monotonicity can be too much to ask for. In particular, if the objective function of an agent is linear in her action or in the presence of coupling constraints of the action sets the game mapping will not be strongly monotone.

Our goal here is to extend the existing payoff-based learning approaches to a broader class of games characterized by monotone game mappings. While algorithms for solving monotone variational inequalities exist (see, for example, Chapter 12 in [9]), these algorithms either consist of two timescales (Tikhonov regularization approach) or have an extra gradient step (extra-gradient methods). As such, they require more coordination between players than that possible in a payoff-based only information structure.

Our contributions are as follows. First, we propose a distributed payoff-based algorithm to learn Nash equilibria in a monotone game, extending our past work [15] applicable to strongly monotone games, inspired by the single timescale algorithm for solving stochastic variational inequalities [6]. Second, despite lack of gradients in a payoff-based information, contrary to the setup in [6], we show that our proposed procedure can be interpreted as a stochastic gradient descent with an additional biasL and regularization terms. Third, we prove convergence of the proposed algorithm to Nash equilibria by suitably bounding the bias and noise variance terms using established results on boundedness and convergence of discrete-time Markov processes.

Notations. The set $\{1,\ldots,N\}$ is denoted by $[N]$ . Boldface is used to distinguish between vectors in a multi-dimensional space and scalars. Given $N$ vectors $\boldsymbol{x}^{i}\in\mathbb{R}^{d}$ , $i\in[N]$ , $(\boldsymbol{x}^{i})_{i=1}^{N}:=({\boldsymbol{x}^{1}}^{\top},\ldots,{\boldsymbol{x}^{N}}^{\top})^{\top}\in\mathbb{R}^{Nd}$ ; $\boldsymbol{x}^{-i}:=({\boldsymbol{x}^{1}},\ldots,{\boldsymbol{x}^{i-1}},{\boldsymbol{x}^{i+1}},\ldots,{\boldsymbol{x}^{N}})\in\mathbb{R}^{(N-1)d}$ . $\mathbb{R}^{d}_{+}$ and $\mathbb{Z}_{+}$ denote respectively, vectors from $\mathbb{R}^{d}$ with non-negative coordinates and non-negative whole numbers. The standard inner product on $\mathbb{R}^{d}$ is denoted by $(\cdot,\cdot)$ : $\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ , with associated norm $\|\boldsymbol{x}\|:=\sqrt{(\boldsymbol{x},\boldsymbol{x})}$ . Given some matrix $A\in\mathbb{R}^{d\times d}$ , $A\succeq(\succ)0$ , if and only if $\boldsymbol{x}^{\top}A\boldsymbol{x}\geq(>)0$ for all $\boldsymbol{x}\neq 0$ . We use the big- $O$ notation, that is, the function $f(x):\mathbb{R}\to\mathbb{R}$ is $O(\mathbf{g}(x))$ as $x\to a$ , $f(x)$ = $O(g(x))$ as $x\to a$ , if $\lim_{x\to a}\frac{|f(x)|}{|g(x)|}\leq K$ for some positive constant $K$ . We say that a function $f(\boldsymbol{x})$ grows not faster than a function $g(\boldsymbol{x})$ as $\boldsymbol{x}\to\infty$ , if there exists a positive constant $Q$ such that $f(\boldsymbol{x})\leq g(\boldsymbol{x})$ $\forall\boldsymbol{x}$ with $\|\boldsymbol{x}\|\geq Q$ .

Definition 1

A mapping $\boldsymbol{M}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is monotone over $X\subseteq\mathbb{R}^{d}$ , if $(\boldsymbol{M}(\boldsymbol{x})-\boldsymbol{M}(\boldsymbol{y}),\boldsymbol{x}-\boldsymbol{y})\geq 0$ for every $\boldsymbol{x},\boldsymbol{y}\in X$ .

II Problem Formulation

Consider a game $\Gamma(N,\{A_{i}\},\{J_{i}\})$ with $N$ players, the sets of players’ actions $A_{i}\subseteq\mathbb{R}^{d}$ , $i\in[N]$ , and the cost (objective) functions $J_{i}:\boldsymbol{A}\to\mathbb{R}$ , where $\boldsymbol{A}=A_{1}\times\ldots\times A_{N}$ denotes the set of joint actions. We restrict the class of games as follows.

Assumption 1

The game under consideration is convex. Namely, for all $i\in[N]$ the set $A_{i}$ is convex and closed, the cost function $J_{i}(\boldsymbol{a}^{i},\boldsymbol{a}^{-i})$ is defined on $\mathbb{R}^{Nd}$ , continuously differentiable in $\boldsymbol{a}$ and convex in $\boldsymbol{a}^{i}$ for fixed $\boldsymbol{a}^{-i}$ .

Assumption 2

The mapping $\boldsymbol{M}:\mathbb{R}^{Nd}\to\mathbb{R}^{Nd}$ , referred to as the game mapping, defined by

[TABLE]

is monotone on $\boldsymbol{A}$ (see Definition 1).

We consider a Nash equilibrium in game $\Gamma(N,\{A_{i}\},\{J_{i}\})$ as a stable solution outcome because it represents a joint action from which no player has any incentive to unilaterally deviate.

Definition 2

A point $\boldsymbol{a}^{*}\in\boldsymbol{A}$ is called a Nash equilibrium if for any $i\in[N]$ and $\boldsymbol{a}^{i}\in A_{i}$

[TABLE]

Our goal is to learn such a stable action in a game through designing a payoff-based algorithm. We first connect existence of Nash equilibria for $\Gamma(N,\{A_{i}\},\{J_{i}\})$ with solution set of a corresponding variational inequality problem.

Definition 3

Consider a mapping $\boldsymbol{T}(\cdot)$ : $\mathbb{R}^{d}\to\mathbb{R}^{d}$ and a set $Y\subseteq\mathbb{R}^{d}$ . A solution $SOL(Y,\boldsymbol{T})$ to the variational inequality problem $VI(Y,\boldsymbol{T})$ is a set of vectors $\mathbf{y}^{*}\in Y$ such that $(\boldsymbol{T}(\mathbf{y}^{*}),\mathbf{y}-\mathbf{y}^{*})\geq 0$ , $\forall\mathbf{y}\in Y$ .

Theorem 1

(Proposition 1.4.2 in [9]) Given a game $\Gamma(N,\{A_{i}\},\{J_{i}\})$ with game mapping $\boldsymbol{M}$ , suppose that the action sets $\{A_{i}\}$ are closed and convex, the cost functions $\{J_{i}\}$ are continuously differentiable in $\boldsymbol{a}$ and convex in $\boldsymbol{a}^{i}$ for every fixed $\boldsymbol{a}^{-i}$ on the interior of $\boldsymbol{A}$ . Then, some vector $\boldsymbol{a}^{*}\in\boldsymbol{A}$ is a Nash equilibrium in $\Gamma$ , if and only if $\boldsymbol{a}^{*}\in SOL(\boldsymbol{A},\boldsymbol{M})$ .

It follows that under Assumptions 1 and 2 for a game with mapping $\boldsymbol{M}$ , any solution of $VI(\boldsymbol{A},\boldsymbol{M})$ is also a Nash equilibrium in such games and vice versa. While $\Gamma(N,\{A_{i}\},\{J_{i}\})$ under Assumptions 1 and 2 might admit a Nash equilibrium, these two assumptions alone do not guarantee existence of a Nash equilibrium. To guarantee existence, one needs to consider a more restrictive assumption, for example, strong monotonicity of the game mapping or compactness of the action sets [9]. Here, we do not restrict our attention to such cases. However, to have a meaning discussion, we do assume existence of at least one Nash equilibrium in the game.

Assumption 3

The set $SOL(\boldsymbol{A},\boldsymbol{M})$ is not empty.

Corollary 1

Let $\Gamma(N,\{A_{i}\},\{J_{i}\})$ be a game with game mapping $\boldsymbol{M}$ for which Assumptions 1, 2, and 3 hold. Then, there exists at least one Nash equilibrium in $\Gamma$ . Moreover, any Nash equilibrium in $\Gamma$ belongs to the set $SOL(\boldsymbol{A},\boldsymbol{M})$ .

The following additional assumptions are needed for convergence of the proposed payoff-based algorithm to a Nash equilibrium (see proofs of Lemma 3 and Theorem 2).

Assumption 4

Each element $\boldsymbol{M}_{i}$ of the game mapping $\boldsymbol{M}:\mathbb{R}^{Nd}\to\mathbb{R}^{Nd}$ , defined in Assumption (2) is Lipschitz continuous on $\mathbb{R}^{d}$ with a Lipschitz constant $L_{i}$ .

Assumption 5

Each cost function $J_{i}(\boldsymbol{a})$ , $i\in[N]$ , grows not faster than a linear function of $\boldsymbol{a}$ as $\|\boldsymbol{a}\|\to\infty$ .

III Payoff-Based Algorithm

Given a payoff-based information, each agent has access to its current action, referred to as its state and denoted by $\mathbf{x}^{i}(t)=(x^{i}_{1},\ldots,x^{i}_{d})^{\top}\in\mathbb{R}^{d}$ , and the cost value $\hat{J}_{i}(t)$ at the joint states $\mathbf{x}(t)=(\mathbf{x}^{1}(t),\ldots,\mathbf{x}^{N}(t))$ , $\hat{J}_{i}(t)=J_{i}(\mathbf{x}(t))=J_{i}(\mathbf{x}^{1}(t),\ldots,\mathbf{x}^{N}(t))$ at iteration $t$ . Using this information in the proposed algorithm each agent $i$ “mixes” its next state $\mathbf{x}^{i}(t+1)$ . Namely, it chooses $\mathbf{x}^{i}(t+1)$ randomly according to the multidimensional normal distribution $\mathcal{N}(\boldsymbol{\mu}^{i}(t+1)=(\mu^{i}_{1}(t+1),\ldots,\mu^{i}_{d}(t+1))^{\top},\sigma(t+1))$ with the density:

[TABLE]

The initial value of the means $\boldsymbol{\mu}^{i}(0)$ , $i\in\{N\}$ , can be set to any finite value. The successive means are updated as follows:

[TABLE]

In the above, $\mbox{Proj}_{C}[\cdot]$ denotes the projection operator on set $C$ , $\gamma(t)$ is a step-size parameter and $\epsilon(t)>0$ is a regularization parameter. We highlight the difference between the proposed approach and that of [15] due to the additional term $\epsilon(t)$ in (2). In the absence of this term the algorithm would not be convergent under a mere monotonicity assumption on the game mapping (see counterexample provided in [4]).

Let us provide insight into the algorithm by deriving an analogy to a regularized stochastic gradient algorithm. Given $\sigma>0$ , for any $i\in[N]$ define $\tilde{J}_{i}:\mathbb{R}^{Nd}\rightarrow\mathbb{R}$ as

[TABLE]

where $p(\boldsymbol{\mu},\boldsymbol{x},\sigma)=\prod_{i=1}^{N}p_{i}(x^{i}_{1},\ldots,x^{i}_{d};\boldsymbol{\mu}^{i},\sigma)$ . Above, $\tilde{J}_{i}$ , $i\in[N]$ , can be interpreted as the $i$ th player’s cost function in mixed strategies. We can now show that the second term inside the projection in (2) is a sample of the gradient of this cost function $\tilde{J}_{i}$ with respect to the mixed strategies. Let $\boldsymbol{\mu}(t)=(\boldsymbol{\mu}^{1}(t),\ldots,\boldsymbol{\mu}^{N}(t))$ .

Lemma 1

Under Assumptions 1 and 5, $\forall i\in[N],k\in[d]$

[TABLE]

Proof:

We verify that the differentiation under the integral sign in (4) is justified. It can then readily be verified that (1) holds, by taking the differentiation inside the integral. A sufficient condition for differentiation under the integral is that the integral of the formally differentiated function with respect to $\mu^{i}_{k}$ converges uniformly, whereas the differentiated function is continuous (see [17], Chapter 17). By formally differentiating the function under the integral sign and omitting the arguments $t$ , we obtain

[TABLE]

Given Assumption 1, $J_{i}(\boldsymbol{x})(x^{i}_{k}-\mu^{i}_{k})p(\boldsymbol{\mu},\boldsymbol{x},\sigma)$ is continuous. Thus, it remains to check that the integral of this function converges uniformly with respect to any $\boldsymbol{\mu}\in\mathbb{R}^{Nd}$ . To this end, we can write the Taylor expansion of the function $J_{i}$ around the point $\boldsymbol{\mu}(i,k)\in\mathbb{R}^{Nd}$ with the coordinates $\mu(i,k)^{i}_{k}=\mu^{i}_{k}$ and $\mu(i,k)^{j}_{m}=x^{j}_{m}$ for any $j\neq i$ , $m\neq k$ , in the integral (6):

[TABLE]

where $\boldsymbol{\eta}(\boldsymbol{x},\boldsymbol{\mu})=\boldsymbol{\mu}(i,k)+\theta(\boldsymbol{x}-\boldsymbol{\mu}(i,k))$ , $\theta\in(0,1)$ , $\boldsymbol{y}=\boldsymbol{x}-\boldsymbol{\mu}(i,k)$ , ${\boldsymbol{\eta}_{1}}(\boldsymbol{y},\boldsymbol{\mu})=\boldsymbol{\mu}(i,k)+\theta\boldsymbol{y}$ . The uniform convergence of the integral above follows from the fact111see the basic sufficient condition using majorant [17], Chapter 17.2.3. that, under Assumption 5, $\frac{\partial J_{i}(\boldsymbol{\eta}_{1}(\boldsymbol{y},\boldsymbol{\mu}))}{\partial x^{i}_{k}}\leq l^{i}_{k}$ for some positive constant $l^{i}_{k}$ and for all $i\in[N]$ and $k\in[d]$ . Hence,

[TABLE]

where $\int_{\mathbb{R}^{Nd}}h(\boldsymbol{y})d\boldsymbol{y}<\infty$ .∎

Lemma (1) shows that the second term inside the projection in (2) is a sample of the gradient of the cost function in mixed strategies. Hence, algorithm (2) can be interpreted as a regularized stochastic projection algorithms. To bound the bias and variance terms of the stochastic projection and consequently establish convergence of the iterates $\boldsymbol{\mu}(t)$ , the parameters $\gamma(t)$ , $\sigma(t)$ , $\epsilon(t)$ need to satisfy certain assumptions.

Assumption 6

Let $\beta(t)=\gamma(t)\sigma^{2}(t)$ and choose $\gamma(t)=\frac{1}{t^{a}}$ , $\sigma(t)=\frac{1}{t^{b}}$ and $\epsilon(t)=\frac{1}{t^{c}}$ , $a,b,c>0$ respectively, such that

a) $\sum_{t=0}^{\infty}\beta(t)=\infty,\quad\lim_{t\to\infty}\epsilon(t)=0,$

b) $\sum_{t=0}^{\infty}\left(1+\frac{1}{\beta(t)\epsilon(t)}\right)\frac{|\epsilon(t-1)-\epsilon(t)|^{2}}{\epsilon^{2}(t)}<\infty,$

c) $\sum_{t=0}^{\infty}\gamma^{2}(t)<\infty,$ $\sum_{t=0}^{\infty}\beta(t)\sigma(t)<\infty$ ,

d) $\lim_{t\to\infty}\sigma(t)=0$ , $\sum_{t=0}^{\infty}\beta(t)\epsilon(t)=\infty.$

Theorem 2

Let the players in game $\Gamma(N,\{A_{i}\},\{J_{i}\})$ choose the states $\{\mathbf{x}^{i}(t)\}$ at time $t$ according to the normal distribution $\mathcal{N}(\boldsymbol{\mu}^{i}(t),\sigma(t))$ , where the mean $\boldsymbol{\mu}^{i}(0)$ is arbitrary and $\boldsymbol{\mu}^{i}(t)$ is updated as in (2). Under Assumptions 1-6, as $t\to\infty$ , the mean vector $\boldsymbol{\mu}(t)$ converges almost surely to a Nash equilibrium $\boldsymbol{\mu}^{*}=\boldsymbol{a}^{*}$ of the game $\Gamma$ and the joint state $\mathbf{x}(t)$ converges in probability to $\boldsymbol{a}^{*}$ .

Remark 1

As an example for existence of parameters to satisfy Assumption 6, let $a=\frac{5}{9}$ , $b=\frac{5}{27}$ , $c=\frac{1}{27}$ .

IV Analysis of the Algorithm

To prove Theorem 2 we first prove boundedness of the iterates $\boldsymbol{\mu}(t)$ . Due to the regularization term $\epsilon(t)$ , this is done by analyzing distance of $\boldsymbol{\mu}(t)$ from the so-called Tikhonov trajectory. Having established this boundedness, we can readily show that the limit of the iterates $\boldsymbol{\mu}(t)$ exists and satisfies the conditions of a Nash equilibrium of the game $\Gamma(N,\{A_{i}\},\{J_{i}\})$ . For the boundedness and the convergence proofs, we use established results on boundedness ([8], Theorem 2.5.2) and convergence of a sequence of stochastic processes (Lemma 10 (page 49) in [10]), respectively. For ease of reference, we provide the statement of ([8], Theorem 2.5.2) and (Lemma 10 (page 49) in [10] ) in the appendix.

IV-A Boundedness of the Algorithm Iterates

We first show that algorithm (2) falls under the framework of well-studied Robbins-Monro stochastic approximations procedures [1] with an additional regularization $\epsilon(t)$ . Next, leveraging this analogy and results on stability of discrete-time Markov processes ([8], Theorem 2.5.2) applied to the sequence $\boldsymbol{\mu}(t)$ we prove boundedness of the iterates.

Using the notation $\boldsymbol{M}_{i}(\cdot)=(M_{i,1}(\cdot),\ldots,M_{i,d}(\cdot))$ , we can rewrite the algorithm step in (2) in the following form:

[TABLE]

for all $i\in[N]$

[TABLE]

and $\tilde{\boldsymbol{M}}_{i}(\cdot)=(\tilde{M}_{i,1}(\cdot),\ldots,\tilde{M}_{i,d}(\cdot))^{\top}$ is the $d$ -dimensional mapping with the following elements:

[TABLE]

The vector $\boldsymbol{M}(\boldsymbol{\mu}(t))=(\boldsymbol{M}_{1}(\boldsymbol{\mu}(t)),\ldots,\boldsymbol{M}_{N}(\boldsymbol{\mu}(t)))$ corresponds to the gradient term in stochastic approximation procedures, whereas

[TABLE]

is a disturbance of the gradient term. Finally,

[TABLE]

is a martingale difference, namely, according to (1),

[TABLE]

To ensure boundedness of $\boldsymbol{\mu}(t)$ (Lemma 3) we bound the martingale term above (see Inequality (IV-A)). To bound the disturbance of the gradients $\boldsymbol{Q}(\boldsymbol{\mu}(t),\sigma(t))$ (see Equation (42)), we observe that the mapping $\tilde{\boldsymbol{M}}_{i}(\boldsymbol{\mu}(t))$ evaluated at $\boldsymbol{\mu}(t)$ is equivalent to the game mapping in mixed strategies (please see Appendix for the proof of this observation). That is,

[TABLE]

In contrast to stochastic approximation algorithms and the proof in [15], we have an addition term $\epsilon(t)\boldsymbol{\mu}(t)$ to be able to address merely monotone game mappings. As such, to bound $\boldsymbol{\mu}(t)$ we also relate the variations of the sequence $\boldsymbol{\mu}(t)$ to those of the Tikhonov sequence defined below. Let $\boldsymbol{y}(t)=(\boldsymbol{y}^{1}(t),\ldots,\boldsymbol{y}^{N}(t))$ denote the solution of the variational inequality $VI(\boldsymbol{A},\boldsymbol{M}(\boldsymbol{y})+\epsilon(t)\boldsymbol{y})$ , namely

[TABLE]

The sequence $\{\boldsymbol{y}(t)\}$ is known as the Tikhonov sequence and enjoys the following two important properties.

Theorem 3

(Theorem 12.2.3 in [9]) Under Assumptions 2, 3, and 4, $\boldsymbol{y}(t)$ defined in (13) exists and is unique for each $t$ . Moreover, for $\epsilon(t)\downarrow 0$ , $\boldsymbol{y}(t)$ is uniformly bounded and converges to the least norm solution of $VI(\boldsymbol{A},\boldsymbol{M})$ .

Lemma 2

(Lemma 3 in [6]) Under Assumption 2

[TABLE]

where $M_{\boldsymbol{y}}$ is a uniform bound on the norm of the Tikhonov sequence, i.e. $\|\boldsymbol{y}(t)\|\leq M_{\boldsymbol{y}}$ for all $t\geq 0$ .

With the results above in place, we connect the squared distance $\|\boldsymbol{\mu}-\boldsymbol{y}(t)\|^{2}$ to the squared distance $\|\boldsymbol{\mu}-\boldsymbol{y}(t-1)\|^{2}$ for any $\boldsymbol{\mu}\in\boldsymbol{A}$ and $t\geq 1$ . Due to the triangle inequality,

[TABLE]

where in the last inequality we used Lemma 2. Hence, by taking into account that for any $a,b\in\mathbb{R}$ and $\theta>0$

[TABLE]

we conclude from (14) that for any $\theta>0$

[TABLE]

The above bound serves as the main new inequality in order to show almost-sure boundedness of $\|\boldsymbol{\mu}(t)\|$ in comparison to non-regularized stochastic gradient procedures.

Lemma 3

Let Assumptions 2-6 hold in $\Gamma(N,\{A_{i}\},\{J_{i}\})$ and $\boldsymbol{\mu}(t)$ be the vector updated in the run of the payoff-based algorithm (7). Then, $\Pr\{\sup_{t\geq 0}\|\boldsymbol{\mu}(t)\|<\infty\}=1$ .

In the following, for simplicity in notation, we omit the argument $\sigma(t)$ in the terms $\tilde{\boldsymbol{M}}$ , $\boldsymbol{Q}$ , and $\boldsymbol{R}$ . In certain derivations, for the same reason we omit the time parameter $t$ as well.

Proof:

Define $V(t,\boldsymbol{\mu})=\|\boldsymbol{\mu}-\boldsymbol{y}(t-1)\|^{2}$ , where $\boldsymbol{y}(t)$ is the Tikhonov sequence defined by (13). We consider the generating operator of the Markov process $\boldsymbol{\mu}(t)$

[TABLE]

and aim to show that $LV(t,\boldsymbol{\mu})$ satisfies the following decay

[TABLE]

where $\psi\geq 0$ on $\mathbb{R}^{Nd}$ , $\phi(t)>0$ , $\forall t$ , $\sum_{t=0}^{\infty}\phi(t)<\infty$ , $\alpha(t)>0$ , $\sum_{t=0}^{\infty}\alpha(t)=\infty$ . This enables us to apply Theorem 2.5.2 in [8] to directly conclude almost sure boundedness of $\boldsymbol{\mu}(t)$ .

Let us bound the growth of $V(t+1,\boldsymbol{\mu})$ in terms of $V(t,\boldsymbol{\mu})$ . Let $\theta=\beta(t)\epsilon(t)$ in (15). From Assumption 6 b), $\left(1+\frac{1}{\beta(t)\epsilon(t)}\right)\frac{|\epsilon(t-1)-\epsilon(t)|^{2}}{\epsilon^{2}(t)}\rightarrow 0$ as $t\to\infty$ . Hence, $\forall\boldsymbol{\mu}\in\boldsymbol{A}$

[TABLE]

From the procedure for the update of $\boldsymbol{\mu}(t)$ , the non-expansion property of the projection operator, the fact that $\boldsymbol{y}(t)$ belongs to $SOL(\boldsymbol{A},\boldsymbol{M}(\boldsymbol{y})+\epsilon(t)\boldsymbol{y})$ , namely, that $\forall i\in[N]$

[TABLE]

we obtain that for any $i\in[N]$

[TABLE]

where, for ease of notation, we have defined

[TABLE]

Our goal is to bound $\mathrm{E}\{\|\boldsymbol{\mu}^{i}(t+1)-\boldsymbol{y}^{i}(t)\|^{2}|\boldsymbol{\mu}(t)=\boldsymbol{\mu}\}$ above, and use this bound in constructing Inequality (17). As such, we expand $\boldsymbol{G}_{i}$ as below and bound the terms in the expansion.

[TABLE]

Due to Assumption 4, we conclude that

[TABLE]

where in the last inequalities in (IV-A)-(39) we used (18). Let us analyze the terms containing the disturbance of gradient, namely $\boldsymbol{Q}_{i}$ , in Equation (31). Since $\boldsymbol{Q}_{i}(\boldsymbol{\mu}(t))=\tilde{\boldsymbol{M}}_{i}(\boldsymbol{\mu}(t))-\boldsymbol{M}_{i}(\boldsymbol{\mu}(t))$ , due to Assumption 2 and Equation (12), we obtain

[TABLE]

where the last equality is due to the fact that the first central absolute moment of a random variable with a normal distribution $\mathcal{N}(\mu,\sigma)$ is $O(\sigma)$ . The estimation above and (18) imply, in particular, that for any $\boldsymbol{\mu}\in\boldsymbol{A}$

[TABLE]

Finally, we bound the martingale term $\|\boldsymbol{R}_{i}(\mathbf{x}(t),\boldsymbol{\mu}(t))\|^{2}$ .

[TABLE]

where the first inequality is due to the fact that $\mathrm{E}(\xi-\mathrm{E}\xi)^{2}\leq\mathrm{E}\xi^{2}$ and taking into account (11), the second inequality is due to Assumption 5, with $f_{i}(\boldsymbol{\mu},\sigma(t))$ being a quadratic function of $\boldsymbol{\mu}$ and $\sigma(t)$ , $i\in[N]$ . Bringing the inequalities (IV-A)-(IV-A) in the inequality (20), taking into account (18), the Cauchi-Schwarz inequality, and the martingale properties in (11) of $\boldsymbol{R}_{i}$ , $i\in[N]$ , we get

[TABLE]

where in the last inequality we used the fact that $\epsilon(t)\to 0$ (Assumption 6 a)), $\gamma(t)\to 0$ , and $\sigma(t)\to 0$ for all $i\in[N]$ as $t\to\infty$ (Assumption 6 c), d)). Thus, taking into account Assumption 6 c), d) and (49), we obtain

[TABLE]

Using the first inequality in (18), we get

[TABLE]

We conclude from (63) and (58) that

[TABLE]

where

[TABLE]

and the second inequality above is due to the fact that

[TABLE]

According to Assumption 6 b)-c), $\sum_{t=0}^{\infty}h(t)<\infty$ . Furthermore, from Assumption 6 a) $\sum_{t=0}^{\infty}\beta(t)=\infty$ . Taking into account this, (IV-A), and monotonicity of $\boldsymbol{M}$ implying

[TABLE]

we conclude that $LV(t,\boldsymbol{\mu})$ satisfies the decay needed for the application of Theorem 2.5.2 in [8] and consequently, $\boldsymbol{\mu}(t)$ is finite almost surely for any $t\in\mathbb{Z}_{+}$ irrespective of $\boldsymbol{\mu}(0)$ . ∎

IV-B Convergence of the Algorithm

Fortunately, the derivations in the previous section in proving boundedness of the iterates can be used to also prove convergence of the algorithm. In particular, we use Inequality (58), which bounds the decay of the sequence $\mathrm{E}[\|\boldsymbol{\mu}(t+1)-\boldsymbol{y}(t)\|^{2}|\boldsymbol{\mu}(t)]$ in terms of $\|\boldsymbol{\mu}-\boldsymbol{y}(t)\|^{2}$ . We can show that this decay satisfies the conditions for applying Lemma 10 in [10]. From this, it can readily be inferred that random variables $\|\boldsymbol{\mu}(t)-\boldsymbol{y}(t-1)\|$ converge to zero. In essence, the approach is similar to showing that $V(t,\mu)$ serves as a stochastic Lyapunov function for the sequence of random variables.

Proof:

(of Theorem 2) First, rewrite (58) as follows:

[TABLE]

where $\mathcal{F}_{t}$ is the $\sigma$ -algebra generated by the random variables $\{\mathbf{x}(k),\boldsymbol{\mu}(k)\}_{k=0}^{t}$ and $h(t)$ is defined in (68). In (71) to get the first inequality we used (70), to get the second inequality we used Lemma 3, namely the fact that $\boldsymbol{\mu}(t)$ is almost surely bounded for all $t\in\mathbb{Z}_{+}$ , to get the third inequality we used (18), and to get the last inequality we used the fact that $(1-2\epsilon(t)\beta(t))(1+\epsilon(t)\beta(t))<(1-\epsilon(t)\beta(t))$ .

From Assumption 6, and the choices of $\gamma(t)$ , $\sigma(t)$ , $\epsilon(t)$ , we get $O(h(t))=\frac{1}{t^{l}}$ , $\epsilon(t)\beta(t)=\frac{1}{t^{m}}$ , with $l>1$ , $m\leq 1$ . Thus,

[TABLE]

Assumption 6 d), the fact that $\sum_{t=0}^{\infty}h(t)<\infty$ and the above result in the decay (71) imply that we can apply Lemma 10 in [10] to the sequence $\|\boldsymbol{\mu}(t+1)-\boldsymbol{y}(t)\|^{2}$ to conclude its almost sure convergence to [math] as $t\to\infty$ . Next, by taking into account Theorem 3 and Theorem 1, we obtain that

[TABLE]

where $\boldsymbol{a}^{*}$ is the least norm Nash equilibrium in the game $\Gamma(N,\{A_{i}\},\{J_{i}\})$ . Finally, Assumption 6 implies that $\lim_{t\to\infty}\sigma(t)=0$ . Taking into account that $\mathbf{x}(t)\sim\mathcal{N}(\boldsymbol{\mu}(t),\sigma(t))$ , we conclude that $\mathbf{x}(t)$ converges weakly to a Nash equilibrium $\boldsymbol{a}^{*}=\boldsymbol{\mu}^{*}$ . Moreover, according to Portmanteau Lemma [5], this convergence is also in probability.

∎

V Simulation Results

As noted in the introduction, the work [4] provides a counterexample showing that the class of gradient-based procedures proposed in [16] and [15] fail to converge to a Nash equilibrium, if the game mapping is merely monotone. In this section, we demonstrate that the inclusion of the Tikhonov regularization term in algorithm [15] rectifies this issue. In particular, the payoff-based algorithm proposed here converges to the Nash equilibrium in the game under consideration.

Following the discussion in [4], we consider the game with $2$ players, whose action sets are $1$ -dimensional sets $A_{1}=A_{2}=[-1,1]$ and the cost functions are $J_{1}(a_{1},a_{2})=a_{1}a_{2}$ and $J_{2}(a_{1},a_{2})=-a_{1}a_{2}$ respectively. It can be verified that the game mapping $M(a_{1},a_{2})=(a_{2},-a_{1})$ is monotone and the unique Nash equilibrium in this game is $\boldsymbol{a}^{*}=(0,0)$ . By implementing the payoff-based algorithm (7) with randomly chosen initial values $\mu^{1}(0)$ and $\mu^{2}(0)$ and the parameters $\gamma(t)$ , $\sigma(t)$ , and $\epsilon(t)$ set up according to Remark 1, we obtain the updates for the mean values $\mu^{1}(t)$ and $\mu^{2}(t)$ of the players, presented in Figure 1. As we can see, the procedure ensures the means arrive at a sufficiently small neighborhood of the Nash equilibrium after approximately $900$ iterations and continue approaching it in its further run.

VI Conclusions

We proposed a payoff-based algorithm for learning Nash equilibria in convex games with monotone game mappings. Our algorithm relied on a suitable regularization to handle monotonicity. The convergence proof relied on the analysis of the Tikhonov sequence related to the regularization and well-established results on boundedness and convergence of stochastic processes. Our current work addresses establishing convergence rate of the algorithm under suitable assumptions.

-A Supporting Theorems

Let $\{\mathbf{X}(t)\}_{t}$ , $t\in\mathbb{Z}_{+}$ , be a discrete-time Markov process on some state space $E\subseteq\mathbb{R}^{d}$ , namely $\mathbf{X}(t)=\mathbf{X}(t,\omega):\mathbb{Z}_{+}\times\Omega\to E$ , where $\Omega$ is the sample space of the probability space on which the process $\mathbf{X}(t)$ is defined. The transition function of this chain, namely $\Pr\{\mathbf{X}(t+1)\in\Gamma|\mathbf{X}(t)=\mathbf{X}\}$ , is denoted by $P(t,\mathbf{X},t+1,\Gamma)$ , $\Gamma\subseteq E$ .

Definition 4

The operator $L$ defined on the set of measurable functions $V:\mathbb{Z}_{+}\times E\to\mathbb{R}$ , $\mathbf{X}\in E$ , by

[TABLE]

is called a generating operator of a Markov process $\{\mathbf{X}(t)\}_{t}$ .

Next, we formulate the following theorem for discrete-time Markov processes, which is proven in [8], Theorem 2.5.2.

Theorem 4

Consider a Markov process $\{\mathbf{X}(t)\}_{t}$ and suppose that there exists a function $V(t,\mathbf{X})\geq 0$ such that $\inf_{t\geq 0}V(t,\mathbf{X})\to\infty$ as $\|\mathbf{X}\|\to\infty$ and

[TABLE]

where $\psi\geq 0$ on $\mathbb{R}\times\mathbb{R}^{d}$ , $f(t)>0$ , $\sum_{t=0}^{\infty}f(t)<\infty$ . Let $\alpha(t)$ be such that $\alpha(t)>0$ , $\sum_{t=0}^{\infty}\alpha(t)=\infty$ . Then, almost surely $\sup_{t\geq 0}\|\mathbf{X}(t,\omega)\|=R(\omega)<\infty$ .

The following result related to the convergence of the stochastic process is proven in Lemma 10 (page 49) in [10].

Theorem 5

Let $v_{0},\ldots,v_{k}$ be a sequence of random variables, $v_{k}\geq 0$ , $\mathrm{E}v_{0}<\infty$ and let

[TABLE]

where $\mathcal{F}_{k}$ is the $\sigma$ -algebra generated by the random variables $\{v_{0},\ldots,v_{k}\}$ , $0<\alpha_{k}<1$ , $\sum_{k=0}^{\infty}\alpha_{k}=\infty$ , $\beta_{k}\geq 0$ , $\sum_{k=0}^{\infty}\beta_{k}<\infty$ , $\lim_{k\to\infty}\frac{\beta_{k}}{\alpha_{k}}=0$ . Then $v_{k}\to 0$ almost surely, $\mathrm{E}v_{k}\to 0$ as $k\to\infty$ .

-B Verification of Equation (12)

We will show that the mapping $\tilde{\boldsymbol{M}}_{i}(\boldsymbol{\mu}(t),\sigma(t))$ (see (10)) evaluated at $\boldsymbol{\mu}(t)$ is equivalent to the extended game mapping:

[TABLE]

Note that for simplicity in notation, we drop the dependence on $\sigma(t)$ and on $t$ . Now, using the notations

[TABLE]

we have that for any $i\in[N]$ , $k\in[d]$ , $\tilde{M}_{i,k}(\boldsymbol{\eta})$

[TABLE]

In the above, for the second equality, we used Lemma (1) to enable differentiation under the integral and for the last equality, we used the fact that according to Assumption 5,

[TABLE]

for any fixed $\mu_{k}^{i}$ , $\boldsymbol{x}^{-i}$ . Now, by definition of $\boldsymbol{M}_{i}(\boldsymbol{x})$ , we have that

[TABLE]

as desired.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Bharath and V. S. Borkar. Stochastic approximation algorithms: Overview and recent trends. Sadhana , 24(4):425–452, 1999.
2[2] L. E. Blume. The statistical mechanics of strategic interaction. Games and economic behavior , 5(3):387–424, 1993.
3[3] P. Frihauf, M. Krstic, and T. Basar. Nash equilibrium seeking in noncooperative games. IEEE Transactions on Automatic Control , 57(5):1192–1207, 2012.
4[4] S. Grammatico. Comments on “distributed robust adaptive equilibrium computation for generalized convex games [automatica 63(2016) 82-91)”. Automatica , 97:186 – 188, 2018.
5[5] A. Klenke. Probability theory: a comprehensive course . Springer, London, 2008.
6[6] J. Koshal, A Nedić, and U. Shanbhag. Single timescale regularized stochastic approximation schemes for monotone nash games under uncertainty. In IEEE Conference on Decision and Control , pages 231–236, 2010.
7[7] J. R. Marden and J. S. Shamma. Revisiting log-linear learning: Asynchrony, completeness and payoff-based implementation. Games and Economic Behavior , 75(2):788 – 808, 2012.
8[8] M. B. Nevelson and R. Z. Khasminskii. Stochastic approximation and recursive estimation . American Mathematical Society, 1973.