Decentralized Learning for Optimality in Stochastic Dynamic Teams and   Games with Local Control and Global State Information

Bora Yongacoglu; G\"urdal Arslan; Serdar Y\"uksel

arXiv:1903.05812·math.OC·March 28, 2024·IEEE Trans. Autom. Control.

Decentralized Learning for Optimality in Stochastic Dynamic Teams and Games with Local Control and Global State Information

Bora Yongacoglu, G\"urdal Arslan, Serdar Y\"uksel

PDF

TL;DR

This paper introduces a novel decentralized learning algorithm that guarantees convergence to team optimal policies in stochastic dynamic teams and games, using only local information and independent learning agents.

Contribution

It presents the first formal convergence guarantees for independent learners achieving team optimality in stochastic dynamic teams and common interest games.

Findings

01

Algorithm guarantees convergence to team optimal policies.

02

Agents only use local controls, costs, and global state information.

03

First formal proof of independent learners achieving team optimality.

Abstract

Stochastic dynamic teams and games are rich models for decentralized systems and challenging testing grounds for multi-agent learning. Previous work that guaranteed team optimality assumed stateless dynamics, or an explicit coordination mechanism, or joint-control sharing. In this paper, we present an algorithm with guarantees of convergence to team optimal policies in teams and common interest games. The algorithm is a two-timescale method that uses a variant of Q-learning on the finer timescale to perform policy evaluation while exploring the policy space on the coarser timescale. Agents following this algorithm are "independent learners": they use only local controls, local cost realizations, and global state information, without access to controls of other agents. The results presented here are the first, to our knowledge, to give formal guarantees of convergence to team optimality…

Equations246

I_{0} = {x_{0}}, I_{t + 1} = I_{t} \cup {x_{t + 1}, u_{t}, c (x_{t}, u_{t})}, t \in N .

I_{0} = {x_{0}}, I_{t + 1} = I_{t} \cup {x_{t + 1}, u_{t}, c (x_{t}, u_{t})}, t \in N .

J_{x}(\theta):=E^{\theta}\left(\sum_{t\in\mathbb{N}}\beta^{t}c(x_{t},u_{t})\Big{|}x_{0}=x\right),\quad\forall x\in\mathbb{X}

J_{x}(\theta):=E^{\theta}\left(\sum_{t\in\mathbb{N}}\beta^{t}c(x_{t},u_{t})\Big{|}x_{0}=x\right),\quad\forall x\in\mathbb{X}

Q_{t + 1} (x_{t}, u_{t}) =

Q_{t + 1} (x_{t}, u_{t}) =

+ α_{t} (x_{t}, u_{t}) (c (x_{t}, u_{t}) + β v \in U min Q_{t} (x_{t + 1}, v))

Q_{t + 1} (x, u) =

I_{0}^{i} = {x_{0}}, I_{t + 1}^{i} = I_{t}^{i} \cup {x_{t + 1}, u_{t}^{i}, c^{i} (x_{t}, u_{t})}, t \in N .

I_{0}^{i} = {x_{0}}, I_{t + 1}^{i} = I_{t}^{i} \cup {x_{t + 1}, u_{t}^{i}, c^{i} (x_{t}, u_{t})}, t \in N .

J^{i}_{x}(\bm{\theta}):=E^{\bm{\theta}}\bigg{(}\sum_{t\in\mathbb{N}}(\beta^{i})^{t}c^{i}(x_{t},\textbf{u}_{t})\Big{|}x_{0}=x\bigg{)},\quad\forall x\in\mathbb{X}

J^{i}_{x}(\bm{\theta}):=E^{\bm{\theta}}\bigg{(}\sum_{t\in\mathbb{N}}(\beta^{i})^{t}c^{i}(x_{t},\textbf{u}_{t})\Big{|}x_{0}=x\bigg{)},\quad\forall x\in\mathbb{X}

J_{x}^{i} (π^{* i}, θ^{- i}) = π^{i} \in Δ^{i} min J_{x}^{i} (π^{i}, θ^{- i}), \forall x \in X .

J_{x}^{i} (π^{* i}, θ^{- i}) = π^{i} \in Δ^{i} min J_{x}^{i} (π^{i}, θ^{- i}), \forall x \in X .

J_{x}^{i} (π^{* i}, θ^{- i}) < J_{x}^{i} (π^{i}, θ^{- i}), for some x \in X .

J_{x}^{i} (π^{* i}, θ^{- i}) < J_{x}^{i} (π^{i}, θ^{- i}), for some x \in X .

BR^{i} (θ^{- i}) := {π^{* i} \in Π^{i} : π^{* i} is a best reply to θ^{- i}} .

BR^{i} (θ^{- i}) := {π^{* i} \in Π^{i} : π^{* i} is a best reply to θ^{- i}} .

BR^{i} (θ^{- i})

BR^{i} (θ^{- i})

= {π^{i} \in Π^{i} : Q_{θ^{- i}}^{* i} (x, π^{i} (x)) = v^{i} \in U^{i} min Q_{θ^{- i}}^{* i} (x, v^{i}), \forall x \in X} .

c^{i} = c, β^{i} = β, \forall DM^{i} .

c^{i} = c, β^{i} = β, \forall DM^{i} .

J_{x}^{i} (π^{*}) = π \in Π in f J_{x}^{i} (π) \forall i, x \in X .

J_{x}^{i} (π^{*}) = π \in Π in f J_{x}^{i} (π) \forall i, x \in X .

π \in Π in f x \in X \sum J_{x}^{i} (π) < x \in X \sum J_{x}^{i} (\tilde{π}), \forall DM^{i} .

π \in Π in f x \in X \sum J_{x}^{i} (π) < x \in X \sum J_{x}^{i} (\tilde{π}), \forall DM^{i} .

x \in X \sum Q_{π^{*- i}}^{i} (x, π^{* i} (x)) < x \in X \sum Q_{\tilde{π}^{- i}}^{i} (x, \tilde{π}^{i} (x)) .

x \in X \sum Q_{π^{*- i}}^{i} (x, π^{* i} (x)) < x \in X \sum Q_{\tilde{π}^{- i}}^{i} (x, \tilde{π}^{i} (x)) .

x \in X \sum Q_{π^{*- i}}^{i} (x, π^{* i} (x)) =

x \in X \sum Q_{π^{*- i}}^{i} (x, π^{* i} (x)) =

x \in X \sum J_{x}^{i} (π^{*}) \leq x \in X \sum u^{i} \in U^{i} min Q_{\tilde{π}^{- i}}^{i} (x, u^{i}) < x \in X \sum Q_{\tilde{π}^{- i}}^{i} (x, \tilde{π}^{i} (x)) .

x \in X \sum J_{x}^{i} (π^{*}) \leq x \in X \sum u^{i} \in U^{i} min Q_{\tilde{π}^{- i}}^{i} (x, u^{i}) < x \in X \sum Q_{\tilde{π}^{- i}}^{i} (x, \tilde{π}^{i} (x)) .

π_{k + 1}^{i} \sim (1 - λ^{i}) Unif (BR^{i} (π_{k}^{- i})) + λ^{i} I_{π_{k}^{i}},

π_{k + 1}^{i} \sim (1 - λ^{i}) Unif (BR^{i} (π_{k}^{- i})) + λ^{i} I_{π_{k}^{i}},

R^{i,\lambda^{i}}(\tilde{\pi}^{i}|\pi^{i},B^{i}):=\left\{\begin{array}[]{cl}1,&\textrm{if}\ \pi^{i}\in B^{i},\ \tilde{\pi}^{i}=\pi^{i}\\ \lambda^{i},&\textrm{if}\ \pi^{i}\not\in B^{i},\ \tilde{\pi}^{i}=\pi^{i}\\ \frac{1-\lambda^{i}}{|B^{i}|},&\textrm{if}\ \pi^{i}\not\in B^{i},\ \tilde{\pi}^{i}\in B^{i}\\ 0,&\textrm{otherwise}\end{array}\right.,

R^{i,\lambda^{i}}(\tilde{\pi}^{i}|\pi^{i},B^{i}):=\left\{\begin{array}[]{cl}1,&\textrm{if}\ \pi^{i}\in B^{i},\ \tilde{\pi}^{i}=\pi^{i}\\ \lambda^{i},&\textrm{if}\ \pi^{i}\not\in B^{i},\ \tilde{\pi}^{i}=\pi^{i}\\ \frac{1-\lambda^{i}}{|B^{i}|},&\textrm{if}\ \pi^{i}\not\in B^{i},\ \tilde{\pi}^{i}\in B^{i}\\ 0,&\textrm{otherwise}\end{array}\right.,

μ_{γ, κ, h}^{*} (Π_{opt}) \geq 1 - ϵ /2.

μ_{γ, κ, h}^{*} (Π_{opt}) \geq 1 - ϵ /2.

n \to \infty lim μ_{0} A_{γ, κ, h}^{n} = μ_{γ, κ, h}^{*} .

n \to \infty lim μ_{0} A_{γ, κ, h}^{n} = μ_{γ, κ, h}^{*} .

π^{*} \in Π_{opt} \sum μ_{γ, κ, h}^{*} (π^{*})

π^{*} \in Π_{opt} \sum μ_{γ, κ, h}^{*} (π^{*})

= π^{*} \in Π_{opt} \sum π \in Π_{opt} \sum μ_{γ, κ, h}^{*} (π) A_{γ, κ, h} (π, π^{*})

+ π^{*} \in Π_{opt} \sum π \in / Π_{opt} \sum μ_{γ, κ, h}^{*} (π) A_{γ, κ, h} (π, π^{*})

\geq π \in Π_{opt} \sum μ_{γ, κ, h}^{*} (π) i \prod (1 - γ^{i})

+ π \in / Π_{opt} \sum μ_{γ, κ, h}^{*} (π) i \prod (κ^{i} /∣ Π^{i} ∣)

π^{*} \in Π_{opt} \sum μ_{γ, κ, h}^{*} (π^{*}) \geq 1 - \frac{\sum _{i} γ ^{i}}{\sum _{i} γ ^{i} + \prod _{i} ( κ ^{i} /∣ Π ^{i} ∣ )}

π^{*} \in Π_{opt} \sum μ_{γ, κ, h}^{*} (π^{*}) \geq 1 - \frac{\sum _{i} γ ^{i}}{\sum _{i} γ ^{i} + \prod _{i} ( κ ^{i} /∣ Π ^{i} ∣ )}

Pr (x_{H + 1} = x^{'} ∣ x_{0} = x, u_{j} = \tilde{u}_{j}, \forall j \in {0, 1, \dots, H}) > 0.

Pr (x_{H + 1} = x^{'} ∣ x_{0} = x, u_{j} = \tilde{u}_{j}, \forall j \in {0, 1, \dots, H}) > 0.

\overset{γ}{ˉ}_{ϵ} (κ) \in (0, 1), \overset{ˉ}{W}_{ϵ} (γ, κ) \in N_{+}, \overset{ˉ}{T}_{ϵ} (γ, κ, W_{m a x}) \in N_{+}

\overset{γ}{ˉ}_{ϵ} (κ) \in (0, 1), \overset{ˉ}{W}_{ϵ} (γ, κ) \in N_{+}, \overset{ˉ}{T}_{ϵ} (γ, κ, W_{m a x}) \in N_{+}

γ^{i} \in (0, \overset{γ}{ˉ}_{ϵ} (κ)), W^{i} \geq \overset{ˉ}{W}_{ϵ} (γ, κ), T_{k} \geq \overset{ˉ}{T}_{ϵ} (γ, κ, W_{m a x})

γ^{i} \in (0, \overset{γ}{ˉ}_{ϵ} (κ)), W^{i} \geq \overset{ˉ}{W}_{ϵ} (γ, κ), T_{k} \geq \overset{ˉ}{T}_{ϵ} (γ, κ, W_{m a x})

k \in N lim inf Pr (π_{k} \in Π_{opt}) \geq 1 - ϵ

k \in N lim inf Pr (π_{k} \in Π_{opt}) \geq 1 - ϵ

μ_{γ, κ}^{*} (Π_{eq}) \geq 1 - ϵ /4.

μ_{γ, κ}^{*} (Π_{eq}) \geq 1 - ϵ /4.

m \geq \overset{m}{ˉ}, μ_{0} \in P (Π) in f (μ_{0} A_{γ, κ}^{m}) (Π_{eq}) \geq 1 - ϵ /2.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Decentralized Learning for Optimality in Stochastic Dynamic Teams and Games with Local Control and Global State Information

††thanks: A conference version [61] was presented at the 2019 Conference on Decision and Control and serves as an announcement of the partial results presented here without details. The conference version [61] does not contain the results on weakly acyclic games or any of the proofs presented here. ††thanks: This research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Bora Yongacoglu, Gürdal Arslan, and Serdar Yüksel B. Yongacoglu and S. Yüksel are with the Department of Mathematics and Statistics, Queen’s University, Kingston, ON, Canada, email: {1bmy,[email protected]}. G. Arslan is with the University of Hawaii. Email: {[email protected]}.

Abstract

Stochastic dynamic teams and games are rich models for decentralized systems and challenging testing grounds for multi-agent learning. Previous work that guaranteed team optimality assumed stateless dynamics, or an explicit coordination mechanism, or joint-control sharing. In this paper, we present an algorithm with guarantees of convergence to team optimal policies in teams and common interest games. The algorithm is a two-timescale method that uses a variant of Q-learning on the finer timescale to perform policy evaluation while exploring the policy space on the coarser timescale. Agents following this algorithm are “independent learners”: they use only local controls, local cost realizations, and global state information, without access to controls of other agents. The results presented here are the first, to our knowledge, to give formal guarantees of convergence to team optimality using independent learners in stochastic dynamic teams and common interest games.

Index Terms:

Stochastic games; Stochastic optimal control; Cooperative control; Game Theory; Machine learning.

I Introduction

In modern control engineering applications, two challenges are becoming increasingly common: online problems and decentralization. In online problems, the system to be controlled is not initially known by the agent and must be learned. In decentralized systems, several autonomous decision-makers act in a shared environment. This paper is concerned with multi-agent reinforcement learning (MARL), which is at the intersection of these two challenges. We use stochastic games to model the shared environment, and we present algorithms suitable for stochastic dynamic teams under a particular decentralized information structure.

In online problems, important knowledge of the system to be controlled is initially unavailable to the controller. Classical methods for solving control problems, such as linear programming, dynamic programming, and convex analytic methods, cannot be implemented without access to the system model. Instead, the control agent must use observed feedback to learn control policies. Reinforcement learning has had considerable success in single-agent control problems, both in applications and in theory, where methods such as Q-learning [57, 58, 54] recover optimal policies when used in a stationary environment.

A second challenge comes from decentralization. Decentralized systems are characterized by multiple agents acting in a common environment with some local information available to each. The costs incurred by one agent in a decentralized system depend, in general, on its own actions, the actions of other agents, and the history of the system. Such coupled interactions are common in complex, real-world engineering applications. Some examples of systems that are inherently decentralized are sensor networks, stochastic networked control systems, Internet of Things, and energy systems.

Compared to the success of reinforcement learning in stationary single-agent problems, there are relatively few formal results on MARL. This is partly explained by the loss of stationarity: when multiple learning agents interact, a given agent will change its behaviour to exploit learned information. From the point-of-view of the remaining agents, this agent is a part of the environment, and so the environment is non-stationary [18]. Consequently, one of the fundamental assumptions made for single-agent theory does not hold in MARL, and theoretical guarantees do not carry over.

Stochastic games [13, 1, 10, 28] generalize both repeated games [14] and Markov decision problems (MDPs). Like repeated games, players in stochastic games must be strategic and respond to the policies used by other agents. Unlike repeated games, in which the same stage game is played at every time step, the stage game played at a given time in a stochastic game depends on the history of play, which is summarized by a state. As in MDPs, agents in stochastic games must select actions with the state process and its long-term cost implications in mind. As stochastic games provide a rich model for dynamic, strategic decision making, they are a popular framework for studying MARL [30].

Stochastic dynamic teams [20, 65] and common-interest games [3, 50] model cooperative systems and so are of special interest to decentralized control. In teams, all players incur the same stage costs and interests are perfectly aligned. Common interest games generalize teams in a natural way: in common interest games, agents to not necessarily incur identical stage costs, but there are a subset of joint policies which each agent strictly prefers to all other policies. Despite the incentive to coordinate behavior in common interest games, coordination is generally difficult in online problems when information is decentralized.

As we will outline in detail in Section II, there are relatively few theoretical results for stochastic games without control-sharing. Even when assuming full state observability at each agent rather than the more general assumption of partial state observability, there are no rigorous results that guarantee team optimality in truly stochastic teams without relying on control-sharing.

I-A Contributions

In this paper, we present a decentralized learning algorithm for playing stochastic common interest games, a class of games which model decentralized control problems and contain stochastic teams as a special case. We give formal guarantees of convergence to team optimal policies without use of control sharing among agents.

(i)

In Theorem 1, we consider stochastic common interest games and introduce an algorithm (that only uses local cost and local action history and the common state of the system) that provably converges to a team optimal policy in a probabilistic sense that is made precise in the theorem. What makes this algorithm different from our prior work [2], which guaranteed convergence to equilibrium but not team optimal policies, is the utilization of a finite window of the most recent (noisy) aggregate cost scores to adaptively estimate the lowest possible cost for each decision maker.

(ii)

Theorem 2 considers a specific implementation of our main algorithm in the context of weakly acyclic games. We show this algorithm leads to equilibrium policies in weakly acyclic games and, furthermore, if the game is also a common interest game then play will settle to a team optimal policy. This theorem strengthens one of the main results from [2].

(iii)

In Theorem 3, we obtain convergence to team optimality in the stronger sense of almost sure convergence by using constant, preset aspiration levels. This result requires a stronger assumption that the preset aspiration levels separate the team optimal policies from the other policies. Theorem 3 also describes the long-run behaviour of the algorithm with constant aspirations when used in a general stochastic game.

These contributions are the first formal guarantees of achieving team optimality in stochastic common interest games under full state observability but no action sharing.

The remainder of the paper is organized as follows: Section II surveys related literature. In Section III, we specify the stochastic game model and provide relevant background. In Section IV, we present our main algorithm and state Theorem 1. Section V presents Theorem 2, which strengthens a result from [2]. Section VI considers a variant of the main algorithm and presents Theorem 3, which studies the variant algorithm’s long-term behaviour in general sum games. Section VII contains numerical results from a simulation study, and Section VIII concludes the paper. The proofs of our main technical results are contained in the appendices.

I-B Notation

$\mathbb{R}$ denotes the real numbers, $\mathbb{N}$ and $\mathbb{N}_{+}$ denote the nonnegative and positive integers, respectively. $\text{Pr}(\cdot)$ and $E(\cdot)$ denote the probability and the expectation, respectively. For a finite set $S$ , $\mathcal{P}(S)$ denotes the set of probability distributions over $S$ . For finite sets $S,S^{\prime}$ , we let $\mathcal{P}(S^{\prime}|S)$ denote the set of stochastic kernels on $S^{\prime}$ given $S$ . An element $\mathcal{T}\in\mathcal{P}(S^{\prime}|S)$ is a collection of probabilities distributions on $S^{\prime}$ , with one distribution for each $s\in S$ , and we write $\mathcal{T}(\cdot|s)$ for $s\in S$ to make this distributional dependence on $s$ explicit. We write $Y\sim f$ to denote that the random variable $Y$ has distribution $f$ . If the distribution of $Y$ is a mixture of other distributions, say with mixture components $f_{i}$ and weights $p_{i}$ for $1\leq i\leq n$ , we write $Y\sim\sum_{i=1}^{n}p_{i}f_{i}$ . The Dirac distribution concentrated at $x\in\mathbb{R}$ is denoted $\mathbb{I}_{x}$ . For a finite set $S$ , $\text{Unif}(S)$ denotes the uniform distribution over $S$ and $2^{S}$ denotes the set of subsets of $S$ . $(x)^{+}:=\max\{x,0\}$ , for $x\in\mathbb{R}$ .

II Literature Review

Interest in using single-agent reinforcement learning in multi-agent environments dates back at least as far as [51], in which Q-learning is studied in a cooperative predator-prey simulation. In [45], multiple agents run Q-learning in a block-pushing task without sharing actions with one another, and the authors suggest that cooperative behaviour may emerge even without explicit communication between agents.

In addition to presenting empirical results and formal conjectures, an important terminological distinction was popularized in [8], where the authors distinguish between joint action learners and independent learners: joint action learners use the past actions of all agents in their learning, while independent learners use only local action histories.

Early rigorous results on MARL in games was concerned mostly with joint-action learners. In [30], Littman proposed stochastic games as a framework for studying MARL and presented the Minimax Q-learning algorithm, a joint-action learner designed for two-player zero-sum games. Convergence results for this method were proved in [33]. The main idea from [30] was extended in [21] and [22], which present Nash Q-learning, another joint-action learner with convergence guarantees under certain restrictive assumptions. Further contributions in this line include Friend-or-Foe Q-learning [31], Team Q-Learning [32], and several others, e.g. [16, 52]. A considerably different approach is taken in [56], which presents Optimal Adaptive Play (OAP), a joint-action learner based on adaptive play [64] rather than on Q-learning. OAP is shown to converge to a team optimal policy when used in a stochastic team.

Though early rigorous work focused on joint action learners, there has also been persistent interest in independent learners. As the number of joint actions is exponential in the number of agents, the computational burden of a joint action learner at any one agent becomes intractable for problems of even a moderate size. Scalability, robustness, and faster convergence are potential advantages of independent learners over joint action learners [39, 48]. The applicability of the set-up considered here and other advantages are covered in greater detail in [39] and [26]. For a recent survey of MARL that discusses other decentralized set-ups, see [66].

Distributed Q-learning, an independent learner designed for teams, was presented in [25], along with a guarantee of convergence to team optimality in teams with deterministic state dynamics and costs. When using this algorithm, an agent only updates its Q-factors when an improvement is observed, attributing unfavourable feedback to its teammates’ experimenting with other actions. This optimistic approach leads to poor performance in problems with random state transitions or cost readings [39].

An algorithm called Win or Learn Fast Policy Hill Climbing (WoLF-PHC) was introduced in [6]. An agent using WoLF-PHC selects actions according to an exploration policy and iteratively improves its exploration policy using its learned Q-factors by updating toward a best-response. Although no formal results are presented for stochastic games, the key innovation of [6] is its policy update: the agent compares the performance of the current exploration policy to that of a distinguished “average policy.” When the current policy outperforms the average policy, the agent changes its policy relatively slowly; when the current policy is underperforming, the agent changes its policy more rapidly.

Following [25] and [6], a number of algorithms based on Q-learning were proposed for stochastic games. Some of these algorithms, such as Hysteretic Q-learning [37], modify the Q-factor update. Other methods, including the Frequency Maximum Q heuristic presented in [24] and its extensions to stochastic games [38], modify action selection. Still other methods, such as lenient learning [43, 60], modify both the Q-factor update as well as the action selection mechanism in an attempt to achieve optimality in cooperative games. With the exception of [25] described above, these works offer only empirical support for their algorithms, rather than formal guarantees of convergence to team optimality. For a survey on this line of research and a description of obstacles in MARL, see [39].

While researchers in the machine learning community sought empirically successful algorithms for stochastic games, a parallel line of research in the control and operations research communities sought rigorous results in the more restricted class of stateless repeated games.

Among the literature on MARL for repeated games, [7, 35] and [36] are most relevant to this paper. Although the algorithms and analysis presented in these works differ from one another, each operates using the principle of exploring an agent’s set of actions more aggressively when the agent perceives it is underperforming.

Reference [35] presents three algorithms, including Safe Experimentation Dynamics, which is shown to lead to team optimality in repeated teams with high probability. Using this method, an agent maintains a baseline action and baseline cost while experimenting with other actions. Each time an action is taken, its immediate cost is compared with the baseline cost; the baseline cost is adjusted when a lower cost is observed, and the action achieving this lower cost becomes the new baseline.

Agents using the algorithm from [36] maintain a binary “mood” variable, which is meant to capture whether the agent is content with its current performance. It is shown that all stochastically stable outcomes maximize the sum of joint payoffs across all agents.

Aspiration learning for repeated coordination games is presented in [7], along with formal results on the stochastic stability of efficient outcomes. An agent using this algorithm iteratively sets its aspiration level, a scalar threshold value that represents the highest cost (or lowest reward) that the agent finds acceptable. When receiving costs higher than its aspiration level, the agent is unsatisfied and explores alternative actions more aggressively.

Other work in this area includes [34, 44] and [23]. Variants of log-linear learning for repeated games were studied in [34, 44] and come with guarantees on the stochastic stability of efficient outcomes. The stochastic imitation dynamics introduced in [23] assign probability one to efficient outcomes in large class of repeated games.

One explanation for the greater number of rigorous results on independent learners in a repeated game setting is the lack of state dynamics. In repeated games, the same stage game is played in each period and there is no tradeoff between short- and long-term costs. As such, the scalar cost realizations can be used directly when setting aspiration levels (as in [7]) or baseline costs (as in [35] and [36]). In contrast, policy evaluation is inherently slow (due to delayed rewards), noisy, and algorithm dependent in games with random state dynamics, and this is only exacerbated by the presence of other learning agents. Consequently, extending the preceding methods is a significant challenge.

In this paper, we study stochastic teams and common interest games with full state information at each agent but no action sharing between agents. This set-up arises naturally in problems where the state can be sensed by a global sensor and broadcast to agents. In [40], a (physically) distributed array of micro-electro-valves producing controlled and directed micro-air-jets is used to steer the motion of a small object on a smart surface. The state of this system is the current and previous positions of the object which is sensed by an overhead camera and accessed by all control units each controlling a separate valve. Each control unit implements a standard Q-learning algorithm based on the global state and its own control observations (by ignoring the other control units) for reasons stated as follows: “A fully centralized control architecture is not suitable due to processing complexity and the number of communication channels required”. In [26], robotics problems involving multi-dimensional action spaces are considered. The authors observe that centralized approaches in problems with multiple actuators are often intractable due to a combinatorial explosion of the joint state-action space. Among other decentralization schemes, the authors consider the case with full state but only local actions, wherein the actuators are able to sense the global state variable (e.g. two dimensional position and velocity in a vehicle navigation problem; three dimensional position in a joint manipulation task) but do not attempt to sense one another’s actions for computational tractability. Other applications for which this information structure is appropriate include problems where the state variable is a commonly observed price as well as problems in traffic networks, where link latencies can be broadcast using a mobile application.

Another motivation for studying this set-up is that algorithms designed for problems with full state information but no action-sharing have been successful even when used in problems possessing a different information structure, such as partial state observability. Examples of studies that use partial state observations as a surrogate for complete state observations and then use methods designed for our information structure include interference control in wireless networks [15, 42] and cache placement in wireless networks [29]. Many further examples can be found in the area of cognitive radio; see [55] and the references therein.

In [2], we introduced an independent learner that provably leads to equilibrium in weakly acyclic stochastic games in general and in teams in particular. However, stochastic teams generally have both team optimal equilibrium policies and suboptimal equilibrium policies, and suboptimal equilibria can perform arbitrarily worse than an optimal equilibrium. A simple but illustrative example is offered in Section III. Thus, guarantees of finding an equilibrium joint policy are not satisfactory in the context of decentralized control when cost minimization is a design goal. In this paper, we modify the main algorithm from [2] to guarantee convergence to team optimality when possible.

III Background

III-A Stationary Markov Decision Problems and Q-learning

A stationary Markov Decision Problem (MDP) with a discounted cost criterion is a discrete time process characterized by the following:

A finite set of states $\mathbb{X}$ 2. 2.

A random initial state $x_{0}\in\mathbb{X}$ 3. 3.

A finite set of control actions $\mathbb{U}$ 4. 4.

A discount factor $\beta\in(0,1)$ 5. 5.

A cost function $c:\mathbb{X}\times\mathbb{U}\to\mathbb{R}$ 6. 6.

A transition probability kernel $P\in\mathcal{P}(\mathbb{X}|\mathbb{X}\times\mathbb{U})$ for determining the next state given the current state-action.

At time $t\in\mathbb{N}$ , the system is in state $x_{t}\in\mathbb{X}$ and the decision maker111We use the terms agent, decision maker, and player interchangeably. (DM) selects a control action $u_{t}\in\mathbb{U}$ . The DM then incurs a stage cost $c(x_{t},u_{t})$ , and the system randomly transitions to the next state, $x_{t+1}$ , according to the probability distribution $P(\cdot|x_{t},u_{t})$ . We assume that, prior to selecting $u_{t}$ at time $t\in\mathbb{N}$ , the DM has access to the information $I_{t}$ defined by

[TABLE]

A policy is a rule for selecting control actions based on the information available. In principle, the DM may use any function of $I_{t}$ to choose $u_{t}$ , possibly with randomization. Fixing a policy $\theta$ induces a probability distribution on the sequence of state-actions $\{(x_{t},u_{t})\}_{t\in\mathbb{N}}$ . This induced probability measure is used to define the cost criterion:

[TABLE]

where $E^{\theta}$ denotes that the stochastic process $\{(x_{t},u_{t})\})_{t\in\mathbb{N}}$ is determined by the policy $\theta$ .

The DM’s goal is to select a policy that minimizes the cost functional $J_{x}$ in every initial state $x\in\mathbb{X}$ . Although the agent can use an arbitrarily complicated, history dependent policy, it is well-known (see, for example, [19]) that this minimum can be achieved within the simpler set of stationary randomized policies, which we identify with the set $\Delta=\mathcal{P}(\mathbb{U}|\mathbb{X})$ . A stationary randomized policy $\theta\in\Delta$ uses only the most recent state $x_{t}$ to (randomly) select an action $u_{t}$ in a time-invariant manner; that is, when the agent follows a policy $\theta\in\Delta$ , we have $u_{t}\sim\theta(\cdot|x_{t})$ . Within $\Delta$ , we can further restrict our attention (without loss of optimality [19]) to the set of stationary deterministic policies $\Pi$ , which we identify as $\Pi=\{\pi:\mathbb{X}\to\mathbb{U}\}$ . An agent following a policy $\pi\in\Pi$ selects its action as a deterministic function of the state, and we write $u_{t}=\pi(x_{t})$ or $u_{t}\sim\mathbb{I}_{\pi(x_{t})}$ .

When the cost function and transition kernel are known, iterative methods such as value iteration can be used to obtain an optimal policy. Otherwise, model-free reinforcement learning techniques such as Q-learning [57] can be used to recover an optimal policy. In standard Q-learning, the DM begins with arbitrary Q-factors $Q_{0}\in\mathbb{R}^{\mathbb{X}\times\mathbb{U}}$ and updates its Q-factors as follows:

[TABLE]

where $\alpha_{t}(x_{t},u_{t})\in[0,1]$ is the step-size at time $t\in\mathbb{N}$ . If all state-action pairs are visited infinitely often and the step-sizes vanish properly, then $\text{Pr}(Q_{t}\rightarrow Q^{*})=1$ , where $Q^{*}$ is the vector of optimal Q-factors, the unique solution of a Bellman fixed point equation [58], [54].

Once $Q^{*}$ is attained, one can recover the value function $V^{*}$ , using $V^{*}(x)=\min_{u\in\mathbb{U}}Q^{*}(x,u)$ , or an optimal policy $\pi^{*}$ , using $\pi^{*}(x)\in\arg\min_{u\in\mathbb{U}}Q^{*}(x,u)$ . Moreover, learned Q-factors can be exploited during play: [47] presents a Q-learning algorithm in which the DM’s action selection converges to that of an optimal policy.

The popularity of Q-learning in stationary MDPs is justified: it is easy to implement and asymptotically recovers an optimal policy. However, this theoretical guarantee is predicated on the stationarity of the system. When a state-action $(x_{t},u_{t})$ is visited, the feedback received (in the form of a cost $c(x_{t},u_{t})$ and next state $x_{t+1}$ ) is always generated by the same Markovian source. If the system is not stationary, then convergence to the Q-factors $Q^{*}$ is not guaranteed.

III-B Stochastic Games and Decentralized Q-learning

A finite (discounted) stochastic game is a multi-agent generalization of a stationary MDP, and is characterized by

$N\in\mathbb{N}_{+}$ decision makers, the $i^{th}$ denoted by DMi 2. 2.

A finite set of states $\mathbb{X}$ 3. 3.

A random initial state $x_{0}\in\mathbb{X}$ 4. 4.

For each DMi:

A finite set of control actions $\mathbb{U}^{i}$
A discount factor $\beta^{i}\in(0,1)$
A cost function $c^{i}:\mathbb{X}\times\mathbb{U}\to\mathbb{R}$ , where $\mathbb{U}:=\times_{i=1}^{N}\mathbb{U}^{i}$

A transition probability kernel $P\in\mathcal{P}(\mathbb{X}|\mathbb{X}\times\mathbb{U})$ for determining the next state given the current state and joint action.

At time $t\in\mathbb{N}$ , the system is in state $x_{t}$ , and each DMi chooses a control action $u^{i}_{t}$ . While DMi only selects $u^{i}_{t}$ , its incurred cost is given by $c^{i}(x_{t},\textbf{u}_{t})$ , where $\textbf{u}_{t}:=(u^{1}_{t},\dots,u^{N}_{t})$ . Following the play of this stage game, the system randomly transitions to state $x_{t+1}$ according to $P(\cdot|x_{t},\textbf{u}_{t})$ . We consider the situation in which DMi observes only the state variable, its own actions, and its own cost realizations (DMi need not know the functional form of its cost). More precisely, prior to selecting $u_{t}^{i}$ at time $t\in\mathbb{N}$ , DMi has access to the information $I_{t}^{i}$ defined by

[TABLE]

In particular, DMi cannot see the past actions of the other DMs, $u^{j}_{s}$ , for any $j\not=i$ , $s\in\mathbb{N}$ . This is in contrast to previous works such as [56], [21], [31] and [67].

A policy for DMi is a rule for selecting the sequence of local actions given the information available to DMi. As in MDPs, DMi’s goal is to minimize its long-term expected discounted cost. Unlike MDPs, however, DMi’s cost is affected by the control actions of the other agents. We again restrict our attention to stationary randomized policies, which will be justified below. We denote the set of stationary randomized policies for DMi by $\Delta^{i}:=\mathcal{P}(\mathbb{U}^{i}|\mathbb{X})$ , and similarly we use $\Pi^{i}=\{\pi^{i}:\mathbb{X}\to\mathbb{U}^{i}\}$ to denote the set of stationary deterministic policies for DMi.

We use boldface symbols to denote joint objects, i.e. lists of objects with one entry per agent, and we omit the agent superscript. The set of stationary joint policies is thus denoted by $\bm{\Delta}:=\times_{i=1}^{N}\Delta^{i}$ and the set of stationary deterministic joint policies is denoted $\bm{\Pi}:=\times_{i=1}^{N}\Pi^{i}$ .

For notational convenience, we will use the agent superscript $-i$ to refer to a joint quantity for which DMi’s position has been removed. Using this standard convention, the set of stationary joint policies for all agents except DMi is denoted $\bm{\Delta}^{-i}:=\times_{j\not=i}\Delta^{j}$ . Similarly, $\bm{\Pi}^{-i}=\times_{j\not=i}\Pi^{j}$ and $\mathbb{U}^{-i}=\times_{j\not=i}\mathbb{U}^{j}$ . By convention, we may write $\mathbb{U}=\mathbb{U}^{i}\times\mathbb{U}^{-i}$ for any DMi, and similarly for the sets $\bm{\Delta}$ and $\bm{\Pi}$ . This allows us to re-write joint objects while isolating DMi’s role: for instance, a joint action $\textbf{u}\in\mathbb{U}$ can be re-written as $\textbf{u}=(u^{i},\textbf{u}^{-i})$ and a joint policy $\bm{\theta}\in\bm{\Delta}$ can be re-written as $\bm{\theta}=(\theta^{i},\bm{\theta}^{-i})$ .

A joint policy $\bm{\theta}\in\Delta$ induces a probability measure on sequences of states and joint actions, which we use in defining DMi’s cost

[TABLE]

where $E^{\bm{\theta}}$ denotes that the stochastic process $\{(x_{t},\textbf{u}_{t})\}_{t\in\mathbb{N}}$ is determined by the policy $\bm{\theta}$ . Then, each DMi’s goal is to select a policy $\pi^{i}\in\Delta^{i}$ to minimize this cost.

Definition 1.

A policy $\pi^{*i}\in\Delta^{i}$ is called a best reply to $\bm{\theta}^{-i}\in\bm{\Delta}^{-i}$ (for DMi) if

[TABLE]

Any best reply $\pi^{*i}\in\Delta^{i}$ to $\bm{\theta}^{-i}\in\bm{\Delta}^{-i}$ is called a strict best reply with respect to $(\pi^{i},\bm{\theta}^{-i})$ if

[TABLE]

For any fixed $\bm{\theta}^{-i}\in\bm{\Delta}^{-i}$ , DMi faces a stationary MDP; hence, DMi always has a deterministic best reply to any $\bm{\theta}^{-i}\in\bm{\Delta}^{-i}$ . We denote the set of deterministic best replies by

[TABLE]

We can describe the set $\text{BR}^{i}(\bm{\theta}^{-i})$ using the optimal Q-factors for the MDP faced by DMi when playing against the policy $\bm{\theta}^{-i}$ . The vector of optimal Q-factors for this environment is denoted $Q^{*i}_{\bm{\theta}^{-i}}\in\mathbb{R}^{\mathbb{X}\times\mathbb{U}^{i}}$ . We include the policy $\bm{\theta}^{-i}$ in this notation as a reminder that the MDP and optimal Q-factors both depend on the policy used by all other players. Then, $\text{BR}^{i}(\bm{\theta}^{-i})$ can be expressed as

[TABLE]

Definition 2.

A joint policy $\bm{\theta}^{*}\in\bm{\Delta}$ is called a (Markov perfect) equilibrium if $\theta^{*i}$ is a best reply to $\bm{\theta}^{*-i}$ , for all $i$ .

We denote the set of all Markov perfect equilibrium policies by $\bm{\Delta}_{\rm eq}$ and we denote the set of stationary deterministic equilibrium policies by $\bm{\Pi}_{\rm eq}:=\bm{\Delta}_{\rm eq}\cap\bm{\Pi}$ . In any finite discounted stochastic game, the set $\bm{\Delta}_{\rm eq}$ is non-empty [14]. Note, however, that the set $\bm{\Pi}_{\rm eq}$ may be empty in general stochastic games.

Definition 3.

A stochastic game is called a stochastic team (or simply a team) if there exists $c:\mathbb{X}\times\mathbb{U}\to\mathbb{R}$ and $\beta\in(0,1)$ such that

[TABLE]

Definition 4.

A joint policy $\bm{\pi}^{*}\in\bm{\Pi}$ is called team-optimal if

[TABLE]

We use $\bm{\Pi}_{\rm opt}$ to denote the set of team optimal policies, which are stationary deterministic policies by definition. It is easy to see that $\bm{\Pi}_{\rm opt}$ may be empty in a general stochastic game but that $\bm{\Pi}_{\rm opt}$ is non-empty in any stochastic team.

Definition 5.

A stochastic game is called a common interest game if (i) $\bm{\Pi}_{\rm opt}$ is non-empty, and (ii) for any $\tilde{\bm{\pi}}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm opt}$ , we have

[TABLE]

This definition is consistent with the definition of a common interest game introduced in [3] and used in other literature, e.g., [50]. Teams are a proper subclass of common interest games. The repeated game ( $|\mathbb{X}|=1$ ) with the stage cost functions shown in Figure 1 is a common interest game for $a$ , $b>0$ but not a team unless $a=b$ and $\beta^{1}=\beta^{2}$ .

It is immediate that a team-optimal policy is an equilibrium; however, the converse need not be true. For an illustration of how poorly an equilibrium policy can perform with respect to team-optimality, consider again the repeated game presented in Figure 1 with $a=b>0$ and $\beta^{1}=\beta^{2}=\beta\in(0,1)$ . Clearly, the joint policy $\bm{\pi}_{\rm sub}:=(1,1)$ is an equilibrium policy, and so is the team-optimal policy $\bm{\pi}^{*}:=(2,2)$ . We have that $J^{i}(\bm{\pi}_{\rm sub})-J^{i}(\bm{\pi}^{*})=\frac{2a}{1-\beta}$ , for each agent $i\in\{1,2\}$ , which shows that the performance gap between an equilibrium policy and a team-optimal policy can be arbitrarily large. This provides the motivation for designing decentralized algorithms that allow agents to learn team-optimal policies, when they exist.

Our objective is the following: given a common interest game, we wish to provide each DM with a decentralized learning algorithm that does not use control sharing and that provably leads, in some appropriate sense, to a team optimal policy.

In [2], we presented an algorithm that leads to equilibrium policies in weakly acyclic games, another class of games (different from common interest games) that generalizes teams. These algorithms instruct DMs to use the same stationary policy, called baseline policies, for large number of consecutive stages, the collection of which is called an exploration phase. At the end of an exploration phase, DMs update their baseline policies in a synchronized manner. In this way, the system is stationary for long enough for Q-learning to return meaningful Q-factors. The Q-factors acquired during an exploration phase are used to construct best replies; Q-factors are then reset for the next exploration phase. The DMs use inertial best-responding to update their baseline policies, and it is shown that this process leads to equilibrium policies in weakly acyclic games.

In the next section, we present a decentralized learning algorithm that leads to team-optimal policies, when they exist. The algorithm here uses the exploration phase technique from [2], but modifies the baseline policy update in order to exploit the following structural result on Q-factors in teams and common interest games.

Lemma 1.

In a common interest game, for all $i$ , $\bm{\pi}^{*}\in\bm{\Pi}_{\rm opt}$ , $\tilde{\bm{\pi}}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm opt}$ , we have

[TABLE]

This fact provides for us an avenue for separating team-optimal policies from the other policies by focusing on Q-factors.

Proof.

For all $i$ , $\bm{\pi}^{*}\in\bm{\Pi}_{\rm opt}$ , $\tilde{\bm{\pi}}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm opt}$ , we have

[TABLE]

If $\tilde{\pi}^{i}\in\text{BR}^{i}(\tilde{\bm{\pi}}^{-i})$ , then $J_{x}^{i}(\tilde{\bm{\pi}})=Q^{i}_{\tilde{\bm{\pi}}^{-i}}(x,\tilde{\pi}^{i}(x))$ ; otherwise,

[TABLE]

∎

IV Learning Team Optimality

In this section, we introduce a learning algorithm for achieving team optimality in teams and common interest games. To motivate our algorithm, we first study a time-homogenous Markov chain $\{\bm{\pi}_{k}\}_{k\geq 0}$ , taking values in the set of joint stationary deterministic policies $\bm{\Pi}$ . The dynamics of this Markov chain will be determined by the Idealized Update Procedure (IUP), detailed in Algorithm 1. While the IUP cannot be implemented in a stochastic common interest game under the information structure of interest, the resulting Markov chain will be used in approximation arguments in the proofs of our main results.

Under inertial best-responding with inertia parameter $\lambda^{i}\in(0,1)$ , at time $k\in\mathbb{N}$ , DMi checks whether its current policy $\pi^{i}_{k}$ is a best-reply to the policy being used by other players, i.e. it checks if $\pi^{i}_{k}\in\text{BR}^{i}(\bm{\pi}^{-i}_{k})$ ; if it is, then $\pi^{i}_{k+1}=\pi^{i}_{k}$ . Otherwise DMi is not best-replying and selects

[TABLE]

that is, switches to a random best-reply with probability $1-\lambda^{i}$ or is inert (does not change away from $\pi^{i}_{k}$ ) with probability $\lambda^{i}$ . Including inertia in one’s policy update can be used to avoid cycling in best-reply dynamics. For example, in the game in Figure 1, if play starts at either joint policy $(1,2)$ or at $(2,1)$ and both players switch to a best-reply at each step, the joint policy will cycle between $(1,2)$ and $(2,1)$ perpetually. Such cycling can be avoided by using explicit coordination mechanisms for determining which DM should change its policy and at what time, but such mechanisms may not be feasible in decentralized settings. Simple decentralized mechanisms such as inertia can been used with the same effect [34, 35].

The condition-dependent nature of inertial best-responding can be captured using a stochastic kernel $R^{i,\lambda^{i}}\in\mathcal{P}(\Pi^{i}|\Pi^{i}\times 2^{\Pi^{i}})$ , where $R^{i,\lambda^{i}}$ selects a successor policy randomly, conditioning on the current policy and the current (perhaps estimated) best-reply set. To allow for uncertainty of $\text{BR}^{i}(\bm{\pi}^{-i}_{k})$ , we define $R^{i,\lambda^{i}}$ as follows:

[TABLE]

for any $\pi^{i}\in\Pi^{i},B^{i}\in 2^{\Pi^{i}}$ and $\tilde{\pi}^{i}\in\Pi^{i}$ .

Note that selecting $\pi^{i}_{k+1}\sim R^{i,\lambda^{i}}(\cdot|\pi^{i},\text{BR}^{i}(\bm{\pi}_{k}^{-i}))$ is equivalent to selecting $\pi^{i}_{k+1}$ according to inertial best-responding with parameter $\lambda^{i}$ .

Under the Idealized Update Procedure (IUP), presented in Algorithm 1, DMi chooses $\pi^{i}_{k+1}$ according to a mixture of uniform random experimentation and inertial best-responding when the joint policy is team optimal, i.e. $\bm{\pi}_{k}\in\bm{\Pi}_{\rm opt}$ . When $\bm{\pi}_{k}\notin\bm{\Pi}_{\rm opt}$ , DMi uses a mixture of uniform random experimenting and a player selected stochastic kernel $h^{i}\in\mathcal{P}(\Pi^{i}|\Pi^{i}\times 2^{\Pi^{i}})$ to choose $\pi^{i}_{k+1}$ .

We will require that DMi randomly explores $\Pi^{i}$ more when the joint policy is not team optimal, i.e. $\kappa^{i}\gg\gamma^{i}$ . Qualitatively, this results in shifting away from suboptimal joint policies more quickly than team optimal policies, and as a result the process spends a large fraction of time in $\bm{\Pi}_{\rm opt}$ . We formalize this intuition below, and note that the guarantee of Lemma 2, on attaining team-optimality in common interest games, holds for arbitrary $\{h^{i}\}_{i=1}^{N}$ . That is, DMi has some flexibility in how it updates its policies when not experimenting and when the current joint policy is not team optimal.

Lemma 2.

Consider a common interest game, and suppose each DMi updates its policies according to the IUP in Algorithm 1. Let $A_{\bm{\gamma},\bm{\kappa},\textbf{h}}$ denote the matrix of the transition probabilities for the induced time-homogenous Markov chain on $\bm{\Pi}$ , where $\bm{\gamma}:=\{\gamma^{i}\}_{i=1}^{N}$ , $\bm{\kappa}:=\{\kappa^{i}\}_{i=1}^{N}$ , $\textbf{h}=\{h^{i}\}_{i=1}^{N}$ . We denote the associated unique stationary distribution by $\mu^{*}_{\bm{\gamma},\bm{\kappa},\textbf{h}}$ . For any $\epsilon\in(0,1)$ , $\bm{\kappa}\in(0,1)^{N}$ , there exists $\bar{\gamma}_{\epsilon}(\bm{\kappa})>0$ such that if $\gamma^{i}\in(0,\bar{\gamma}_{\epsilon}(\bm{\kappa}))$ for all $i$ , then

[TABLE]

Moreover, for all $\mu_{0}\in\mathcal{P}(\bm{\Pi})$ , we have

[TABLE]

Proof.

Since $\gamma^{i},\kappa^{i}>0$ for all $i$ , the induced Markov chain is irreducible, hence there exists unique $\mu_{\bm{\gamma,\kappa,h}}^{*}$ such that $\mu_{\bm{\gamma,\kappa,h}}^{*}=\mu_{\bm{\gamma,\kappa,h}}^{*}A_{\bm{\gamma,\kappa,h}}$ . We have

[TABLE]

This leads to

[TABLE]

which implies (3). The last part follows from the aperiodicity of the Markov chain. ∎

Lemma 2 shows that if DMs follow the IUP, then they would choose a team-optimal policy in the long run with arbitrarily high probability provided the experimentation probabilities of $\bm{\gamma}$ are positive but sufficiently small relative to $\bm{\kappa}$ .

It is clear that the IUP cannot be directly implemented in our study of decentralized, online teams. The first issue relates to decentralization: DMi cannot observe the policy $\bm{\pi}^{-i}_{k}$ . The second issue relates to the online nature of the problem: even if $\bm{\pi}^{-i}_{k}$ were known, DMi may not know its best-reply set or the set of team optimal policies. Nevertheless, the IUP motivates our decentralized learning algorithm, Algorithm 2, which can be viewed as a two timescale approximation of the IUP. We expand on this point below, after presenting the main result of this section.

We emphasize that Algorithm 2 is decentralized in the sense that it can be implemented by “independent learners,” in the terminology of [39, 66]. That is, each DMi can run a separate copy of this algorithm without reference to the joint actions or policies of the remaining players. We recall that each DMi’s interaction with its environment at any time $t$ consists of sending its control decision $u_{t}^{i}$ and receiving its cost realization $c^{i}(x_{t},u_{t}^{1},\dots,u_{t}^{N})$ as well as the next state $x_{t+1}$ without observing any information about the other DMs, in particular, without observing the control decisions $\textbf{u}_{t}^{-i}$ of the other DMs. In fact, each DMi need not even be aware of the presence of the other DMs or the fact it is engaged in learning in a multi-player game. Simply, each DM is running a single-agent algorithm similar to standard Q-learning (that is re-initialized after its baseline policy is updated at the end of each exploration phase). As such, all quantities computed by DMi’s copy of Algorithm 2 are indexed by $i$ . These remarks also apply verbatim to Algorithm 3 introduced in Section VI.

Assumption 1.

For all $x,x^{\prime}\in\mathbb{X}$ , there exists $H\in\mathbb{N}$ and $\tilde{\textbf{u}}_{0},\dots,\tilde{\textbf{u}}_{H}\in\mathbb{U}$ such that

[TABLE]

Assumption 2.

Assume, for all $i$ , $\delta^{i}\in(0,\bar{\delta})$ , $d^{i}\in(0,\bar{d})$ , $\rho^{i}\in(0,\bar{\rho})$ , where $\bar{\delta}$ , $\bar{d}$ , $\bar{\rho}$ are constants defined in Appendix A that depend only on the game.

Theorem 1.

Consider a common interest game in which each DMi uses Algorithm 2, and let Assumptions 1-2 hold. For any $\epsilon>0$ , there exist

[TABLE]

where $W_{\max}:=\max_{i}W^{i}$ such that if, for all $i$ , $k\in\mathbb{N}$

[TABLE]

then

[TABLE]

Proof.

See Appendix A. ∎

Discussion

Algorithm 2 can be viewed as a two timescale222In two timescale algorithms in the literature (e.g. [5]), both the Q-factors and the policies would be updated incrementally at each time $t=1,2,\dots$ . The step size sequences for Q-learning and policy updating would be selected so that policies are effectively fixed while Q-factors are learned. In our algorithms, the policies are updated without using any step sizes but only only at $t=t_{1}-1,t_{2}-1,\dots$ whereas the Q-factors are updated at each time $t=1,2,\dots$ using step sizes that are re-initialized at $t=t_{1}-1,t_{2}-1,\dots$ and are reduced during $t\in[t_{k},t_{k+1}-1)$ at a rate satisfying the assumptions of the standard (i.e., one time scale) stochastic approximation theory. approximation to the IUP in Algorithm 1. The faster timescale is that where time is indexed by the stage games, comprising lines 16-23 of Algorithm 2. The selection of actions, observation of costs and state transitions, and Q-factor updates all occur on this faster timescale. In contrast, the slower timescale is where time is indexed by the exploration phase. Decisions on the slower timescale involve processing learned Q-factors to estimate one’s best-reply set (line 24), computing a “cost score” and comparing it to historical cost scores (lines 25-27), and updating one’s baseline policy (lines 27-31).

Crucially, the baseline policies are fixed within an exploration phase and only change between exploration phases. This means that for any $k\geq 0$ , from the point of view of any DMi, the environment is stationary within the $k^{th}$ EP and equivalent to an MDP determined by $\bm{\pi}^{-i}_{k}$ . It was shown in [2] that under certain conditions—satisfied here by Assumptions 1 and 2—that Q-learning within an EP leads to informative Q-factors that can be used, among other things, to recover one’s best-reply set $\text{BR}^{i}(\bm{\pi}^{-i}_{k})$ .

The analogy between Algorithm 2 and the IUP can be seen by comparing the if-suite (lines 6-9) in Algorithm 1 with its counterpart (lines 27-30) in Algorithm 2. The unobservable condition $\bm{\pi}_{k}\in\bm{\Pi}_{\rm opt}$ of the IUP has been replaced by a surrogate condition $S^{i}_{k}\leq\Lambda^{i}_{k}$ . Here, $S_{k}^{i}$ is a “cost score,” which aggregates DMi’s policy performance across all states for the $k^{th}$ exploration phase, and $\Lambda_{k}^{i}$ is a measure of DMi’s best performance during the preceding $W^{i}$ exploration phases. Importantly, the condition $S^{i}_{k}\leq\Lambda^{i}_{k}$ can be verified by independent learners.

Algorithm 2 is in the spirit of aspiration learning algorithms [7], where $\Lambda_{k}^{i}$ plays the role of DMi’s aspiration level, a scalar quantity against which DMi compares the performance of its policy $\pi_{k}^{i}$ during the $k^{th}$ exploration phase. Each DMi aspires to perform at least as well as its aspiration level, which is updated at the end of each exploration phase and may be thought of as a maximum tolerable cost, i.e., if the baseline policy yields higher cost, then it is viewed as unsatisfactory.

Unlike the aspiration learning methods in the literature, which focus on repeated games with no state dynamics and players with no look ahead, Algorithm 2 is designed for stochastic dynamic games with nontrivial state dynamics and far-sighted players. Due to the long-run cost considerations in dynamic stochastic games, evaluating of the cost of a policy is a slow and noisy process, which leads to additional difficulties in setting the aspiration levels.

In light of Lemma 1, a viable approach is to use the learned Q-factors to produce cost scores and to set the aspiration levels to the minimum cost score over some window of the past. However, scores obtained from the (random) Q-factors are noisy estimates of the scores corresponding to the true cost of the policies. In particular, setting the aspiration levels to the minimum of the cost scores over the entire past based on the learned Q-factors can result in unattainable aspiration levels. Hence, to mitigate the effects of the noise present in the learned Q-factors, we set the aspiration levels of each DMi to the minimum cost score obtained over a finite window of the most recent past within some tolerance. This allows DMs to discard unattainable cost scores in finite time.

Another aspect of Algorithm 2 is the persistent experimentation in the policy space. Experimentation when DMs feel that they meet their aspiration levels ( $S_{k}^{i}\leq\Lambda_{k}^{i})$ is required to prevent DMs settling in a policy that is not team-optimal. This is due to the finite window approach used for setting the aspiration levels and the possibility of setting suboptimal aspiration levels. Experimentation when ( $S_{k}^{i}>\Lambda_{k}^{i})$ is also necessary to aid DMs in searching for team-optimal policies.

Finally, we note that the set of approximate best responses $\text{BR}_{k}^{i}$ computed by each DMi within each exploration phase $k$ is a subset of $\Pi^{i}$ , the set of stationary and deterministic policies of DMi. Therefore, $|\text{BR}_{k}^{i}|\leq|\Pi^{i}|=|\mathbb{U}^{i}|^{|\mathbb{X}|}$ . We note that $\text{BR}_{k}^{i}$ is computed via the Q-factors $Q^{i}_{t_{k}+1}\in\mathbb{R}^{\mathbb{X}\times\mathbb{U}^{i}}$ , which is of size $|\mathbb{X}||\mathbb{U}^{i}|$ .

V Beyond Team Optimality: Application to Weakly Acyclic Games

In this section, we consider a special case of Algorithm 2 that has desirable convergence properties in weakly acyclic games, in addition to providing team-optimality in the sense of Theorem 1.

Definition 6.

A (possibly finite) sequence $\bm{\pi}_{0},\bm{\pi}_{1},\dots$ in $\bm{\Pi}$ is called a multi-DM strict best reply path if, for each $k$ , $\bm{\pi}_{k}$ and $\bm{\pi}_{k+1}$ differ for at least one DM and, for each deviating DMi, $\bm{\pi}_{k+1}^{i}$ is a strict best reply with respect to $\bm{\pi}_{k}$ .

Definition 7.

A stochastic game is called weakly acyclic under multi-DM strict best replies (or simply weakly acyclic) if there is a multi-DM strict best reply path starting from each deterministic joint policy and ending at a deterministic equilibrium policy.

The notion of weak acyclicity used here is with respect to stationary deterministic policies for stochastic games, and generalizes the notion of weak acyclicity introduced in [63] for single-stage games. All teams are weakly acyclic; however, a common interest game need not be. See [12] for other examples of single-stage weakly acyclic games.

In weakly acyclic games, the inertial best reply dynamics [2] lead to equilibrium policies. If the policy update functions satisfy $h^{i}=R^{i,\lambda^{i}}$ for each DMi, the IUP introduced in the previous section can be regarded as a perturbed inertial best reply dynamics, where $\{\bm{\pi}_{k}\in\bm{\Pi}_{\rm opt}\}$ can be replaced with any arbitrary event if the game is not a common interest game, provided the induced Markov chain is time-homogenous.

Assumption 3.

For every DMi, $h^{i}=R^{i,\lambda^{i}}$ , where $\lambda^{i}\in(0,1)$ .

Under Assumption 3, each DMi always best replies with inertia when not experimenting.

Lemma 3.

Consider a weakly acyclic game. Suppose that each DMi updates its policy according to the IUP of Algorithm 1, and let Assumption 3 hold. Let $A_{\bm{\gamma,\kappa}}$ denote the matrix of the transition probabilities for the induced time homogenous Markov chain on $\bm{\Pi}$ . Denote the unique stationary distribution associated to this Markov chain by $\mu^{*}_{\bm{\gamma,\kappa}}$ . For any $\epsilon>0$ , there exists $\bar{\kappa}_{\epsilon}\in(0,1)$ such that $\max\{\gamma^{i},\kappa^{i}\}\in(0,\bar{\kappa}_{\epsilon})$ , for all $i$ , implies

[TABLE]

Moreover, uniformly over all such $\bm{\gamma,\kappa}$ , there exists $\bar{m}\in\mathbb{N}$ such that

[TABLE]

Proof.

For all $\bm{\pi}^{*}\in\bm{\Pi}_{\rm eq}$ ,

[TABLE]

Let $L_{\bm{\pi}}<|\bm{\Pi}|$ be the length of a multi-DM strict best reply path of minimal length from $\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm eq}$ to some $\tilde{\bm{\pi}}\in\bm{\Pi}_{\rm eq}$ , and $L:=\max_{\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm eq}}L_{\bm{\pi}}$ . For any $\bm{\pi}\not\in\bm{\Pi}_{\rm eq}$ , consider a path $\bm{\pi}=\bm{\pi}_{0},\bm{\pi}_{1},\dots,\bm{\pi}_{L}$ where $\bm{\pi}_{0},\bm{\pi}_{1},\dots,\bm{\pi}_{L_{\bm{\pi}}}$ is a multi-DM strict best reply path and $\bm{\pi}_{L_{\bm{\pi}}}=\cdots=\bm{\pi}_{L}=\tilde{\bm{\pi}}\in\bm{\Pi}_{\rm eq}$ . In each transition $\bm{\pi}_{k}\rightarrow\bm{\pi}_{k+1}$ , some DMs switch to one of their strict best replies and the others stay put. Therefore, from any $\bm{\pi}\not\in\bm{\Pi}_{\rm eq}$ , the IUP with $\bm{\gamma}=\bm{\kappa}\equiv 0$ generates such a path $\bm{\pi}_{0},\bm{\pi}_{1},\dots,\bm{\pi}_{L}$ with probability at least $p_{\min}:=\prod_{i=1}^{N}\min\{\lambda^{i},(1-\lambda^{i})/|\Pi^{i}|\}^{L}\in(0,1)$ . By taking $\gamma^{i}>0$ , $\kappa^{i}>0$ into account, this leads to

[TABLE]

for all $\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm eq}$ . Writing $A=A_{\bm{\gamma,\kappa}}$ , from (4)-(5), we have, for all $k\in\mathbb{N}$ ,

[TABLE]

This leads to, for all $j$ , $k\in\mathbb{N}$ ,

[TABLE]

Since $|1-p_{\min}|<1$ , the desired result follows. ∎

For small experimentation probabilities, the IUP under Assumption 3 leads to equilibrium policies in the long run. We will use this to show that Algorithm 2 under Assumptions 1-3 has the same long run behavior.

For weakly acyclic games, decentralized learning algorithms which assign arbitrarily high probabilities to equilibrium policies in the long run are presented in [2]. However, these algorithms do not provide any guarantee on achieving team-optimality when implemented in teams or common interest games. We now strengthen a result of [2] with respect to team-optimality.

Theorem 2.

Consider a weakly acyclic game in which each DMi uses Algorithm 2, and let Assumptions 1-3 hold. For any $\epsilon>0$ , there exist

[TABLE]

where $W_{\rm max}=\max_{i}W^{i}$ , such that if, for all $i$ , $k\in\mathbb{N}$ ,

[TABLE]

then

[TABLE]

Moreover, if the game is a common interest game, then $\bm{\Pi}_{\rm eq}$ can be replaced by $\bm{\Pi}_{\rm opt}$ in (6).

Proof.

See Appendix B. ∎

VI Learning with Constant Aspirations

In this section, we introduce Algorithm 3, a variant of Algorithm 2 in which every DMi employs a constant aspiration level $\Lambda^{i}\in\mathbb{R}$ throughout, i.e., $\Lambda^{i}_{k}=\Lambda^{i}$ for every exploration phase $k\in\mathbb{N}$ . Pre-setting the aspiration levels is motivated by applications where each DM has the prior knowledge of a conservative estimate of its achievable cost. Such prior knowledge may be available to DMs, for example, from previous experience or through an initial phase of experimentation, and can be used to heuristically discern “good” from “bad” performance. One implication of this assumption is that if there is indeed a set of joint policies each simultaneously outperforming all pre-set aspiration levels (i.e., the cost estimates) and the other joint policies fail to satisfy any DM, we show that DMs using Algorithm 3 will almost surely outperform their aspiration levels in the long run (part (1)-(2) of Theorem 3). This is the case, for example, in a common interest game when the aspiration levels are between the dominant costs and the other costs. In contrast, DMs using Algorithm 2 adaptively adjust their aspiration levels and achieve optimal performance but only in common interest games and in the weaker sense of eventually assigning arbitrarily high probability to the set of optimal policies (Theorem 1). In addition, unlike in Algorithm 2, we characterize the long term behavior of Algorithm 3 in all games regardless of whether or not the pre-set aspiration levels are achievable. Loosely speaking, DMs using Algorithm 3 in any game are likely to use a certain minimal set of policies in the long run, which are closed under multi-agent strict best-replies (part (3)-(4) of Theorem 3). This minimal set of policies reduces to the set $\Pi_{\rm eq}$ of equilibrium policies in any weakly-acyclic game. Thus, in part (3)-(4) of Theorem 3, we characterize the long-term behavior of Algorithm 3 in a manner analogous to and, in fact, more general than Theorem 2, which characterizes the long-term behavior of Algorithm 2 as $\bm{\Pi}_{\rm eq}$ in weakly acyclic games.

The following definition is introduced to describe the long-term behavior of Algorithm 3.

Definition 8.

For any $i$ , $\bm{\eta}\in\bm{\Delta}$ , $\bm{\pi}\in\bm{\Pi}$ , $\bm{\Lambda}\in\mathbb{R}^{N}$ , let

[TABLE]

(i)

Let $\widetilde{BR}(\bm{\pi}):=$

[TABLE]

A nonempty set of policies $\tilde{\bm{\Pi}}\subset\bm{\Pi}$ is closed under multi-DM strict best replies, or a cumber set, if

[TABLE]

A cumber set is minimal if it does not properly contain another cumber set.

(iii)

Let

[TABLE]

A nonempty set of policies $\tilde{\bm{\Pi}}\subset\bm{\Pi}$ is closed under multi-DM strict best replies with aspiration levels $\bm{\Lambda}=\{\Lambda^{i}\}_{i=1}^{N}$ , or a $\bm{\Lambda}$ -cumber set, if

[TABLE]

A $\bm{\Lambda}$ -cumber set is minimal if it does not properly contain another $\bm{\Lambda}$ -cumber set.

Let $\bm{\Pi}_{\rm cumber}$ and $\bm{\Pi}_{\rm cumber}^{\bm{\Lambda}}$ denote the union of minimal cumber sets and the union of $\bm{\Lambda}$ -minimal cumber sets, respectively.

The repeated game ( $|\mathbb{X}|=1$ ) with the stage cost functions shown in Figure 2 is a common interest game for $\beta^{1}=\beta^{2}$ . The minimal cumber sets are $\{$ (1,1),(2,1),(2,2),(1,2) $\}$ , (which is also a strict best-reply path, and $\{(3,3)\}$ , which are also the minimal $\bm{\Lambda}-$ cumber sets for $\Lambda^{1}=\Lambda^{2}<7$ . For $\Lambda^{1}=\Lambda^{2}\in[7,10)$ , there are three minimal $\Lambda-$ cumber sets: $\{$ (2,1) $\}$ , $\{$ (1,2) $\}$ , and $\{(3,3)\}$ . For $\Lambda^{1}=\Lambda^{2}\in[10,20)$ , there are five minimal $\bm{\Lambda}-$ cumber sets: $\{$ (1,1) $\}$ , $\{$ (2,1) $\}$ , $\{$ (2,2) $\}$ , $\{$ (1,2) $\}$ , and $\{(3,3)\}$ . For $\Lambda^{1}=\Lambda^{2}\geq 20$ , any singleton $\{\bm{\pi}\}$ , where $\bm{\pi}\in\bm{\Pi}$ , is a minimal $\bm{\Lambda}$ -cumber set. On the other hand, for $\Lambda^{1}\geq 10$ , $\Lambda^{2}<7$ , the minimal $\bm{\Lambda}-$ cumber sets are $\{$ (1,1) $\}$ , $\{$ (2,2) $\}$ , and $\{(3,3)\}$ .

Allowing only single-DM best replies in the definition of a cumber set results in the notion of a cusber set introduced in [23]. The following are true, for any $\bm{\Lambda}\in\mathbb{R}^{N}$ .

•

$\bm{\Pi}$ is both a cumber set and a $\bm{\Lambda}$ -cumber set.

•

$\bm{\pi}\in\bm{\Pi}_{\rm eq}$ $\Leftrightarrow$ $\{\bm{\pi}\}$ is a (minimal) cumber set.

•

$\bm{\pi}\in\bm{\Pi}_{\rm eq}$ $\Rightarrow$ $\{\bm{\pi}\}$ is a (minimal) $\bm{\Lambda}$ -cumber set.

•

( $\bm{\pi}\in\bm{\Pi}$ , $\tilde{S}^{i}(\bm{\pi})\leq\Lambda^{i}$ , $\forall i$ ) $\Rightarrow$ $\{\bm{\pi}\}$ is a (minimal) $\bm{\Lambda}$ -cumber set.

•

There is a multi-DM strict best reply path from any $\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm cumber}$ to $\bm{\Pi}_{\rm cumber}$ .

•

There is a multi-DM strict best reply path from any $\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm cumber}^{\bm{\Lambda}}$ to $\bm{\Pi}_{\rm cumber}^{\bm{\Lambda}}$ .

•

$\bm{\Pi}_{\rm cumber}=\bm{\Pi}_{\rm eq}\Leftrightarrow$ the game is weakly acyclic under multi-DM strict best replies.

Let $\bar{L}_{\bm{\pi}}<|\bm{\Pi}|$ be the length of a multi-DM strict best reply path of minimal length from $\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm cumber}$ to some $\tilde{\bm{\pi}}\in\bm{\Pi}_{\rm cumber}$ , and $\bar{L}:=\max_{\bm{\pi}\in\bm{\Pi}\setminus\bm{\Pi}_{\rm cumber}}\bar{L}_{\bm{\pi}}$ .

Assumption 4.

Assume, for all $i$ , $\delta^{i}\in(0,\bar{\delta})$ , $\rho^{i}\in(0,\rho^{\bm{\Lambda}})$ , where $\bar{\delta}$ , $\rho^{\bm{\Lambda}}$ are constants defined in Appendix A, C, respectively ( $\bar{\delta}$ depends only on the game, whereas, $\rho^{\bm{\Lambda}}$ depends on the game and $\bm{\Lambda}$ ). Assume further that, for all $i$ , $n\in\mathbb{N}$ , $\gamma_{n}^{i}\in[0,1]$ , $\sum_{n\in\mathbb{N}}\gamma^{i}_{n}<\infty$ , and $\kappa^{i}\in(0,1)$ .

Theorem 3.

Consider a discounted stochastic game where each DMi updates its policies by Algorithm 3, and let Assumptions 1 and 4 hold.

Suppose that $g^{i}=R^{i,1}$ , $\forall i$ , and that there exists a nonempty set $\bm{\Pi}^{\bm{\Lambda}}\subset\bm{\Pi}$ satisfying

[TABLE]

Then, there exist $\tilde{T}_{k}\in\mathbb{N}_{+}$ , $k\in\mathbb{N}$ , such that if $T_{k}\geq\tilde{T}_{k}$ , $\forall k$ , then

[TABLE] 2. 2.

Suppose that $g^{i}=R^{i,\lambda^{i}}$ , $\forall i$ , and that there exists a cumber set $\bm{\Pi}^{\bm{\Lambda}}$ satisfying (7). Then there exists $\tilde{T}_{k}\in\mathbb{N}_{+}$ , $k\in\mathbb{N}$ , such that if $T_{k}\geq\tilde{T}_{k}$ , $\forall k$ , then

[TABLE] 3. 3.

Suppose that $g^{i}=R^{i,1}$ , $h^{i}=R^{i,\lambda^{i}}$ , $\forall i$ . Then,

[TABLE]

for some $\bar{p}_{\min}\in(0,1)$ which is independent of $\sum_{i}\kappa^{i}$ . 4. 4.

Suppose that $g^{i}=h^{i}=R^{i,\lambda^{i}}$ , $\forall i$ . Then,

[TABLE]

for some $\bar{p}_{\min}\in(0,1)$ which is independent of $\sum_{i}\kappa^{i}$ .

Proof.

See Appendix C. ∎

Algorithm 3 prescribes each DMi to update its policy differently (using the policy update kernels $g^{i}$ or $h^{i}$ coupled with different experimentation probabilities $\gamma_{k}^{i}$ or $\kappa^{i}$ ) depending on DMi’s assessment of whether its aspiration is achieved or not. The experimentation probability needs to vanish asymptotically for the former case but be positive throughout333 $\kappa^{i}$ can be time-varying as long as it stay uniformly above zero. for the latter case. In practice, the experimentation probabilities for either case are envisioned to be (asymptotically) small so that the policy updates are primarily governed by $g^{i}$ and $h^{i}$ . With this in mind, Theorem 3 can be interpreted as follows.

The first part of Theorem 3 assumes (i) each DMi stays with its policy when it assesses that its aspiration is achieved, (ii) each policy $\bm{\pi}\in\bm{\Pi}$ either simultaneously achieves every DM’s aspiration (i.e., $\bm{\pi}\in\bm{\Pi}^{\bm{\Lambda}}$ ) or not a single DM’s aspiration (i.e., $\bm{\pi}\not\in\bm{\Pi}^{\bm{\Lambda}}$ ). With this (and regardless of $h^{i}$ ), DMs converge almost surely to an aspiration achieving joint policy. Note that this does not rule out convergence to a strictly dominated policy.

The second part assumes that the aspiration achieving policies are closed under multi-DM strict best replies. That is, it assumes $\bm{\Pi}^{\bm{\Lambda}}$ is a cumber set. Under this condition (and regardless of $h^{i}$ ), DMs converge almost surely to a subset of the aspiration achieving joint policies, which is a minimal cumber set. Note that this rules out neither persistent oscillations within a minimal cumber set (inside the aspiration achieving policies) nor convergence to a set of strictly dominated policies. However, in a weakly-acyclic game (under multi-DM strict best replies), convergence to an aspiration achieving equilibrium is guaranteed; in particular, the equilibrium policies not achieving DMs’ aspirations are ruled out. This implies convergence to an optimal policy in teams if the aspiration levels are between the cost of suboptimal and optimal equilibria. If $\bm{\Pi}^{\bm{\Lambda}}$ is not a cumber set, DMs can leave $\bm{\Pi}^{\bm{\Lambda}}$ through multi-DM strict best replies and the result may not hold.

Theorem 3 also predicts the long-term behavior of Algorithm 3 when the joint policies $\bm{\Pi}$ cannot be partitioned as aspiration achieving policies ( $\bm{\Pi}^{\bm{\Lambda}}$ ) and the other policies in the sense of (7). The third part of Theorem 3 assumes that each DMi stays with its policy when its aspiration is achieved, otherwise best replies with inertia, i.e., $g^{i}=R^{i,1}$ , $h^{i}=R^{i,\lambda^{i}}$ . With this (and regardless of the game), DMs’ long-term probability of choosing a policy in a minimal $\bm{\Lambda}$ -cumber set (a minimal set that DMs cannot exit through the strict best replies of those whose aspirations are not achieved) can be arbitrarily close to one if the experimentation probabilities are sufficiently small. The fourth part assumes that each DMi always best replies with inertia (when it is not experimenting), i.e., $g^{i}=h^{i}=R^{i,\lambda^{i}}$ . With this (and regardless of the game), DMs tend to choose policies in a minimal cumber set (the equilibria and the minimal multi DM strict best reply cycles) for small experimentation probabilities. Under the conditions of the third or the fourth part, DMs may not consistently achieve their aspirations.

VII A Simulation Study

We consider the following two DM stochastic team with $\mathbb{U}^{1}=\mathbb{U}^{2}=\mathbb{X}=\{1,2\}$ and common discount factor $\beta=0.8$ . The stage cost for each state is presented in Figure 3.

$x_{t}=1$ is the low cost state and $x_{t}=2$ is the high cost state. The transition probabilities, given below, are constructed so that when DMs successfully coordinate their decisions (in a state-dependent manner) the state transitions with high probability to the low cost state. Otherwise, the state transitions with high probability to the high cost state. The transition kernel is fully described by

[TABLE]

In particular, when $x_{t}=2$ , DMs are faced with the choice between on the one hand incurring a lower short term cost $10$ and likely remaining in the high cost state and on the other hand paying a higher short term cost $13$ with the hopes of transitioning to the low cost state and avoiding sustained high costs.

For sufficiently large discount factors (including $\beta=0.8$ as selected), the unique team optimal policy is for both DMs to coordinate as $u_{t}^{1}=u_{t}^{2}=x_{t}$ , for all $t\in\mathbb{N}$ . However, there are three suboptimal equilibrium policies, namely (i) $u_{t}^{1}=u_{t}^{2}=1$ , for all $t\in\mathbb{N}$ , (ii) $u_{t}^{1}=u_{t}^{2}=2$ , for all $t\in\mathbb{N}$ , (iii) $u_{t}^{1}=u_{t}^{2}\not=x_{t}$ , for all $t\in\mathbb{N}$ .

We simulated Algorithm 2 and 3 with the following parameter choices.

[TABLE]

where the aspiration level $\Lambda^{i}$ used in case C and D was chosen without extensive tuning.

The algorithms performed generally as expected. The disparity across different cases owes largely to the parameter selections. In each case, the percentage of time where the joint policies are team optimal, i.e., $\bm{\pi}_{k}\in\bm{\Pi}_{\rm opt}$ , are shown below.

[TABLE]

As the experimentation probability $\gamma$ is reduced, the empirical frequency of the event $\bm{\pi}_{k}\in\bm{\Pi}_{\rm opt}$ increases and, for $\gamma=0.001$ , the joint policies are team optimal for more than $90\%$ of the time. These numerical results confirm the theoretical results.

VIII Concluding Remarks

In this paper, we presented learning algorithms for stochastic teams and common interest games under a decentralized information structure in which players do not share actions with one another. While previous studies have focused on repeated games, or otherwise used a large degree of control sharing among decision makers to obtain convergence results, we have provided a method for achieving team optimality in teams and stochastic common interest without any control sharing during play and with limited prior information about the game.

The proof methods used in this paper center on approximating the true joint policy-valued stochastic process using time homogenous Markov chains through a novel Dobrushin’s coefficient based analysis. The algorithms presented are amenable to further variations and can be modified as needed, and the Markov chain analysis used for the convergence guarantees can likewise be easily modified for more general applications.

We chose to focus on games with full state observations available to each agent since there are few formal results on multi-agent learning even under this simplifying assumption. The partially observed information structure, in which each player has access to only local state information, is an important and challenging direction for future research.

Appendix A: Proof of Theorem 1

Let $\sigma(A)\in[0,1]$ denote the Dobrushin coefficient of an $n\times n$ right stochastic matrix $A$ , defined as [11]

[TABLE]

Lemma 4.

Consider an $n\times n$ right stochastic matrix $A$ with $\sigma(A)>0$ , and a sequence of $n\times n$ right stochastic matrices $\{A_{k}\}_{k\in\mathbb{N}}$ . For any $\epsilon\in(0,1)$ , if

[TABLE]

then, for any probability vector $\mu_{0}$ of dimension $n$ ,

[TABLE]

where $\mu^{*}$ is the unique probability vector satisfying $\mu^{*}=\mu^{*}A$ .

Proof.

Recall that $\|\mu A-\nu A\|_{1}\leq(1-\sigma(A))\|\mu-\nu\|_{1}$ , for all probability vectors $\mu$ , $\nu$ ; see [11]. Since $\sigma(A)>0$ , by Banach’s fixed point theorem, there exists a unique probability vector $\mu^{*}$ satisfying $\mu^{*}=\mu^{*}A$ , and $\lim_{k}\mu_{0}A^{k}=\mu^{*}$ , for any probability vector $\mu_{0}$ .

From (8)-(9), we have $\sup_{k\in\mathbb{N}}|\sigma(A_{k})-\sigma(A)|\leq n\tau$ , which implies $\sup_{k\in\mathbb{N}}(1-\sigma(A_{k}))\leq\xi:=1-\sigma(A)/2.$ Note $\xi\in(0,1)$ . We write

[TABLE]

Repeated application of these inequalities result in

[TABLE]

where $\epsilon=n\tau\frac{1}{1-\xi}$ , which is consistent with (9). As $\lim_{k}\xi^{k}\|\mu_{0}-\mu^{*}\|_{1}=0$ , the lemma follows. ∎

Proof of Theorem 1

Let $\epsilon\in(0,1)$ and $\bm{\kappa}\in(0,1)^{N}$ . By Lemma 2, there exists $\bar{\gamma}_{\epsilon}(\bm{\kappa})$ such that $\max_{i}\gamma^{i}\in(0,\bar{\gamma}_{\epsilon}(\bm{\kappa}))$ implies $\mu_{\bm{\gamma,\kappa,h}}^{*}(\bm{\Pi}_{\rm opt})\geq 1-\epsilon/2$ , where $\mu_{\bm{\gamma,\kappa,h}}^{*}$ is the unique invariant measure of the Markov chain induced by the IUP. Assume $\max_{i}\gamma^{i}\in(0,\bar{\gamma}_{\epsilon}(\bm{\kappa}))$ .

For all $k\in\mathbb{N}$ , $\bm{\pi}$ , $\bm{\pi}^{\prime}\in\bm{\Pi}$ , we define

[TABLE]

where $\bm{\pi}_{k}$ is the joint baseline policy during the $k^{th}$ exploration phase of Algorithm 2. Note that $\mu_{k+1}=\mu_{0}A_{0}\cdots A_{k}$ . To prove the theorem, we will show

[TABLE]

Due to Lemma 4 and $\sigma(A_{\bm{\gamma,\kappa,h}})>0$ 444 $A_{\bm{\gamma,\kappa,h}}(\bm{\pi},\bm{\pi}^{\prime})\geq\prod_{i=1}^{N}\min\{\gamma^{i},\kappa^{i}\}/|\Pi^{i}|>0$ , $\forall\bm{\pi},\bm{\pi}^{\prime}\in\bm{\Pi}$ , due to uniform experimentation by each DMi with probability $\gamma^{i}>0$ or $\kappa^{i}>0$ . By (8), this implies $\sigma(A_{\bm{\gamma,\kappa,h}})>0$ ., it is sufficient to show

[TABLE]

for all but finitely many $k\in\mathbb{N}$ . We note that $\sigma(A_{\bm{\gamma,\kappa,h}})>0$ since all entries of $A_{\bm{\gamma,\kappa,h}}$ are strictly positive, as the IUP updates policies using uniform randomization with strictly positive probability owing to $\gamma^{i},\kappa^{i}>0$ for every DMi.

To ensure (12), we will introduce an event $R_{k}$ such that, for all $\bm{\pi}$ , $\bm{\pi}^{\prime}\in\bm{\Pi}$ , and all but finitely many $k\in\mathbb{N}$ ,

[TABLE]

and we will show that

[TABLE]

by choosing the parameters of Algorithm 2 appropriately. Note that (13)-(14) imply (12) as follows:

[TABLE]

where $R_{k}^{c}$ denotes the complement of $R_{k}$ .

Define

[TABLE]

Let $\bar{\pi}_{k}^{i}\in\Delta^{i}$ denote the policy used by DMi in the $k^{th}$ exploration phase, i.e.,

[TABLE]

Let $\bar{\rho}>0$ be such that $\max_{i}\rho^{i}\in(0,\bar{\rho})$ implies

[TABLE]

Such $\bar{\rho}$ exists due to [2, Lemma 3]. Assume that, for all $i$ , $d^{i}\in(0,\bar{d})$ , $\delta^{i}\in(0,\bar{\delta})$ , $\rho^{i}\in(0,\bar{\rho})$ . Assume further that

[TABLE]

where $\phi:=\prod_{i}\min\{\gamma^{i}/|\Pi^{i}|,\kappa^{i}/|\Pi^{i}|\}\in(0,1)$ .

For any time $k\geq W_{\max}$ , we define the event

[TABLE]

where, for any $\ell\in\mathbb{N}$ , we define

[TABLE]

Conditioned on $R_{k}$ , $k\geq W_{\max}$ , we have, for all $i$ ,

[TABLE]

and

[TABLE]

This implies (13), for all $k\geq W_{\max}$ . Intuitively, the event $R_{k}$ guarantees that (i) all players have sufficiently reliable Q-factors during the $k^{th}$ exploration phase, due to $F_{k}$ ; (ii) for every DMi, the estimated cost scores are sufficiently close to the true cost scores during every exploration phase in DMi’s most recent memory window, by $G_{k}$ ; (iii) an optimal baseline policy was visited recently enough that all players remember its cost score, by $H_{k}$ .

We will now show (14) for sufficiently large exploration lengths $\{T_{\ell}\}_{\ell}$ . (Since the events $G_{\ell}$ and $F_{\ell}$ are defined in terms of Q-factors, this is a statement about the long term behavior of the Q-factor iterates within an exploration phase.)

Note that within the $k^{th}$ exploration phase of Algorithm 2 (and Algorithm 3), the environment faced by each DMi, that is determined by $\bm{\pi}^{-i}_{k}$ , is a stationary MDP (with finite state and control spaces) and satisfies the usual conditions of stochastic approximation theory. In such a setting, it is well-known that the sequence of Q-factors produced by the standard Q-learning algorithm from any initial condition is bounded and convergent with probability one [54]. Since each exploration phase in Algorithm 2 (and in Algorithm 3) starts with re-initialized Q-factors (and what we may call J-factors in the case of Algorithm 3) within the compact sets $\{\mathbb{Q}^{i}\}_{i=1}^{N}$ (and $\{\mathbb{J}^{i}\}_{i=1}^{N}$ ), the Q-factors (and J-factors) produced by Algorithm 2 (and Algorithm 3) during any exploration phase remain bounded with probability one; c.f. Lemma 1 in [2].

Furthermore, Lemma 1 in [2] shows that, uniformly in the initial conditions within $\{\mathbb{Q}^{i}\}_{i=1}^{N}$ (and $\{\mathbb{J}^{i}\}_{i=1}^{N}$ ), the Q-factors (and J-factors) produced by Algorithm 2 (and Algorithm 3) enter any arbitrarily small neighborhood of their limits with arbitrarily high probability at the end of any sufficiently long exploration phase.

By [2, Lemma 4], there exists $T_{\epsilon}(\bm{\gamma,\kappa},W_{\max})\in\mathbb{N}_{+}$ such that if $\min_{k\in\mathbb{N}}T_{k}\geq T_{\epsilon}(\bm{\gamma,\kappa},W_{\max})$ , we have

[TABLE]

This implies, for $k\geq W_{\max}$ ,

[TABLE]

In addition, we have, for $k\geq W_{\max}$ ,

[TABLE]

All together, the preceding imply, for $k\geq W_{\max}$ ,

[TABLE]

Since $\inf_{k\in\mathbb{N},\bm{\pi}\in\bm{\Pi}}\text{Pr}(\bm{\pi}_{k}=\bm{\pi})\geq\phi>0$ , (16) implies (14), because (16) implies $\text{Pr}(R_{k}\cap\{\bm{\pi}_{k}=\bm{\pi}\})\geq(1-\tau)\text{Pr}(\bm{\pi}_{k}=\bm{\pi})$ for any $k,\bm{\pi}$ .

We have shown that (13) and (14) hold. In turn, this implies (12) holds, and invoking Lemma 4 completes the proof. $\square$

Appendix B: Proof of Theorem 2

Lemma 5.

Consider an $n\times n$ right stochastic matrix $A$ , and a sequence of $n\times n$ right stochastic matrices $\{A_{k}\}_{k\in\mathbb{N}}$ . For any $\epsilon\in(0,1)$ , $m\in\mathbb{N}$ , if

[TABLE]

then

[TABLE]

where $\mu_{0}$ is any probability vector of dimension $n$ .

Proof of Theorem 2

Let $\epsilon\in(0,1)$ . Assume

[TABLE]

where $\bar{\kappa}_{\epsilon}$ and $\bar{m}$ are as in Lemma 3. Then, assume

[TABLE]

where $\bar{\gamma}_{\epsilon}(\bm{\kappa})$ is as in Lemma 2. With these choices of $\bm{\gamma}$ , $\bm{\kappa}$ , Lemma 3 holds, i.e.,

[TABLE]

For any $k\in\mathbb{N}$ , defining $A_{k}$ as in (11), we have

[TABLE]

By [2, Lemma 4], there exists $T_{\epsilon}\in\mathbb{N}_{+}$ such that if $T_{k}\geq T_{\epsilon}$ ,

[TABLE]

Assume that, for all $i$ , $k\in\mathbb{N}$ ,

[TABLE]

where $\bar{W}_{\epsilon}(\bm{\gamma,\kappa})$ , $\bar{T}_{\epsilon}(\bm{\gamma,\kappa},W_{\max})$ are as in Theorem 1. By (19)-(20) and the assumptions on $\bm{\gamma}$ , $\bm{\kappa}$ , $\{T_{k}\}_{k\in\mathbb{N}}$ , we have

[TABLE]

Lemma 5 implies

[TABLE]

The desired result for weakly acyclic games follows from (18)-(21). Note that the parameter choices satisfy the hypothesis of Theorem 1; hence, the results of Theorem 1 also hold.

Appendix C: Proof of Theorem 3

Let $\bar{\delta}^{i}$ be as in (15), and let $\rho^{\bm{\Lambda}}\in(0,1)$ be such that $\rho^{i}\in(0,\rho^{\bm{\Lambda}})$ , for all $i$ , implies

[TABLE]

and

[TABLE]

where $\bar{\pi}^{i}_{k}(\cdot|x_{k})=(1-\rho^{i})\mathbb{I}_{\pi^{i}_{k}(x_{k})}+\rho^{i}\text{Unif}(\mathbb{U}^{i})$ . Such $\rho^{\bm{\Lambda}}\in(0,1)$ exists due to [2, Lemma 3].

Let $\epsilon_{k}\in(0,1)$ , $k\in\mathbb{N}$ , be such that $\sum_{k\in\mathbb{N}}\epsilon_{k}<\infty$ . Due to [2, Lemma 1], there exists finite integers $\tilde{T}_{k}\in\mathbb{N}_{+}$ , $k\in\mathbb{N}$ such that if $T_{k}\geq\tilde{T}_{k}$ , for all $k\in\mathbb{N}$ ,

[TABLE]

Assume now $T_{k}\geq\tilde{T}_{k}$ , for all $k\in\mathbb{N}$ . Hence, there exists $\tilde{k}\in\mathbb{N}_{+}$ such that, for all $i$ , $k\geq\tilde{k}$ ,

[TABLE]

where

[TABLE]

We have, for all $k\geq\tilde{k}$ ,

[TABLE]

This leads to, with some algebra, for all $k\geq\tilde{k}$ ,

[TABLE]

Since $\Big{|}1-\prod_{i}(\kappa^{i}/|\Pi^{i}|)\Big{|}<1$ , $\sum_{k\in\mathbb{N}}\epsilon_{k}<\infty$ , and $\sum_{i,k\in\mathbb{N}}\gamma_{k}^{i}<\infty$ , we have $\sum_{k\in\mathbb{N}}\text{Pr}(\bm{\pi}_{k}\not\in\bm{\Pi}^{\bm{\Lambda}})<\infty$ . Borel-Cantelli Lemma implies

[TABLE]

Also, $\sum_{k\in\mathbb{N}}\text{Pr}((\bm{\pi}_{k}\in\bm{\Pi}^{\bm{\Lambda}},\tilde{S}_{k}^{i}\geq\Lambda^{i})<\infty$ , hence

[TABLE]

This proves the first part. 2. 2.

We have, for all $k\geq\tilde{k}$ ,

[TABLE]

There exists $\bar{p}_{\min}\in(0,1)$ (which depends only on $\lambda^{1},\dots,\lambda^{N}$ , $|\Pi^{1}|,\dots,|\Pi^{N}|$ , $\bar{L}$ ) such that, for all $k\geq\tilde{k}$ ,

[TABLE]

and

[TABLE]

This leads to, for all $k\geq\tilde{k}$ ,

[TABLE]

Since $\Big{|}1-\min\Big{\{}\bar{p}_{\min},\prod_{i}\frac{\kappa^{i}}{|\Pi^{i}|}\Big{\}}\Big{|}<1$ , $\sum_{k\in\mathbb{N}}\epsilon_{k}<\infty$ , and $\sum_{i,k\in\mathbb{N}}\gamma_{k}^{i}<\infty$ , we have $\sum_{j\in\mathbb{N}}\text{Pr}(\bm{\pi}_{k+j\bar{L}}\not\in\bm{\Pi}^{\bm{\Lambda}}\cap\bm{\Pi}_{\rm cumber})<\infty$ , for all $k\in\mathbb{N}$ . This results in $\sum_{k\in\mathbb{N}}\text{Pr}(\bm{\pi}_{k}\not\in\bm{\Pi}^{\bm{\Lambda}}\cap\bm{\Pi}_{\rm cumber})<\infty$ . Borel-Cantelli Lemma implies

[TABLE]

Also $\sum_{k\in\mathbb{N}}\text{Pr}(\text{BR}_{k}^{i}\not=\text{BR}^{i}(\bm{\pi}_{k}^{-i}))<\infty$ , hence,

[TABLE]

This proves the second part. 3. 3.

We have, for all $k\geq\tilde{k}$ ,

[TABLE]

We also have, for all $k\geq\tilde{k}$ ,

[TABLE]

where $\bar{p}_{\min}\in(0,1)$ is as in (22). This leads to, for all $k\geq\tilde{k}$ ,

[TABLE]

Since $|1-\bar{p}_{\min}|<1$ and $\lim_{k\in\mathbb{N}}\sum_{n=k}^{k+\bar{L}-1}\epsilon_{n}=0$ , we have, for all $k\in\mathbb{N}$ ,

[TABLE]

This proves the third part. 4. 4.

It follows exactly the same as the third part by replacing $\bm{\Pi}_{\rm cumber}^{\bm{\Lambda}}$ with $\bm{\Pi}_{\rm cumber}$ .

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Eitan Altman. Non zero-sum stochastic games in admission, service and routing control in queueing systems. Queueing Systems , 23(1-4):259–279, 1996.
2[2] Gürdal Arslan and Serdar Yüksel. Decentralized Q-learning for stochastic teams and games. IEEE Transactions on Automatic Control , 62(4):1545–1558, 2017.
3[3] Robert J Aumann and Sylvain Sorin. Cooperation and bounded recall. Games and Economic Behavior , 1(1):5 – 39, 1989.
4[4] Vivek Borkar. Reinforcement learning in Markovian evolutionary games. Advances in Complex Systems , 5 (1):55–72, 2002.
5[5] Vivek S Borkar. Stochastic approximation with two time scales. Systems & Control Letters , 29(5):291–294, 1997.
6[6] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate. Artificial Intelligence , 136(2):215–250, 2002.
7[7] G. C. Chasparis, A. Arapostathis, and J. S. Shamma. Aspiration learning in coordination games. SIAM J. Control and Optimization , 51(1):465–490, 2013.
8[8] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Tenth Innovative Applications of Artificial Intelligence Conference, Madison, Wisconsin , pages 746–752, July 1998.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Decentralized Learning for Optimality in Stochastic Dynamic Teams and Games with Local Control and Global State Information

Abstract

Index Terms:

I Introduction

I-A Contributions

I-B Notation

II Literature Review

III Background

III-A Stationary Markov Decision Problems and Q-learning

III-B Stochastic Games and Decentralized Q-learning

Definition 1**.**

Definition 2**.**

Definition 3**.**

Definition 4**.**

Definition 5**.**

Lemma 1**.**

Proof.

IV Learning Team Optimality

Lemma 2**.**

Proof.

Assumption 1**.**

Assumption 2**.**

Theorem 1**.**

Proof.

Discussion

V Beyond Team Optimality: Application to Weakly Acyclic Games

Definition 6**.**

Definition 7**.**

Assumption 3**.**

Lemma 3**.**

Proof.

Theorem 2**.**

Proof.

VI Learning with Constant Aspirations

Definition 8**.**

Assumption 4**.**

Theorem 3**.**

Proof.

VII A Simulation Study

VIII Concluding Remarks

Appendix A: Proof of Theorem 1

Lemma 4**.**

Proof.

Proof of Theorem 1

Appendix B: Proof of Theorem 2

Lemma 5**.**

Proof of Theorem 2

Appendix C: Proof of Theorem 3

Definition 1.

Definition 2.

Definition 3.

Definition 4.

Definition 5.

Lemma 1.

Lemma 2.

Assumption 1.

Assumption 2.

Theorem 1.

Definition 6.

Definition 7.

Assumption 3.

Lemma 3.

Theorem 2.

Definition 8.

Assumption 4.

Theorem 3.

Lemma 4.

Lemma 5.