Rethinking Formal Models of Partially Observable Multiagent Decision   Making

Vojt\v{e}ch Kova\v{r}\'ik; Martin Schmid; Neil Burch; Michael Bowling,; Viliam Lis\'y

arXiv:1906.11110·cs.AI·September 29, 2021

Rethinking Formal Models of Partially Observable Multiagent Decision Making

Vojt\v{e}ch Kova\v{r}\'ik, Martin Schmid, Neil Burch, Michael Bowling,, Viliam Lis\'y

PDF

TL;DR

This paper introduces factored-observation stochastic games (FOSGs) to clarify the relationship between extensive-form games and partially observable stochastic games, enabling better transfer of ideas and improved decomposition in multiagent decision-making.

Contribution

It proposes FOSGs as a new formalism that connects EFGs and POSGs, and demonstrates how key EFG techniques can be adapted within this framework.

Findings

01

FOSGs simplify decomposition in multiagent models.

02

Unrolling FOSGs yields equivalent extensive-form games.

03

Key EFG algorithms are adaptable to FOSGs.

Abstract

Multiagent decision-making in partially observable environments is usually modelled as either an extensive-form game (EFG) in game theory or a partially observable stochastic game (POSG) in multiagent reinforcement learning (MARL). One issue with the current situation is that while most practical problems can be modelled in both formalisms, the relationship of the two models is unclear, which hinders the transfer of ideas between the two communities. A second issue is that while EFGs have recently seen significant algorithmic progress, their classical formalization is unsuitable for efficient presentation of the underlying ideas, such as those around decomposition. To solve the first issue, we introduce factored-observation stochastic games (FOSGs), a minor modification of the POSG formalism which distinguishes between private and public observation and thereby greatly simplifies…

Tables1

Table 1. Table 1 : A “dictionary” of the basic concepts of MDPs (and other models used in RL), EFGs, and FOSGs. The correspondences are on the level of usage; for example, the correspondence between a state in an MDP and a history in an EFG does not imply they are mathematically equivalent, but rather that the role states play in MDPs is analogous to the role histories play in EFGs.

RL		EFG		FOSG
The actor in the task.
$i$	agent	$i$	player	$i$	agent, player
The acting rule.
$π$	policy	$σ \in Σ$	strategy	$π$ / $σ$	policy/strategy
A measure of how long the task takes.
$T$	finite horizon	$D$	game depth	$T$	finite horizon
The basic notion for describing the “state of the world”.
$s \in 𝒮$	state	$h \in ℋ$	history	$w \in 𝒲$	world state
A list of everything that has happened so far, from the objective point of view.
–	trajectory	$h \in ℋ$	history	$h \in ℋ$ $t$	history trajectory
Information based on which actors select their next actions.
$s$ $b$	state (MDP) belief (POMDP)	$I \in ℐ_{i}$	infoset	$s_{i} \in 𝒮_{i}$	action-obs. history
The immediate feedback on how well the actor is currently doing.
$r$	reward		doesn’t exist	$𝒪_{i} (\cdot)$	observation
The quantity that actors optimize for.
	cum. reward	$u_{i}$	utility	$u_{i} (h)$ $ℛ_{i} (h)$	utility cum. reward
The desirability of a given situation for an actor who uses a given acting rule.
$V^{π} (s)$	$s$ ’s value for $π$	–	exp. utility at $h$	$v_{i}^{σ} (h)$	history value
The cause of non-determinism in the task.
$𝒯 (s, a)$	transition prob.	$σ_{c} (h, a)$	chance strategy	$𝒯 (w, a)$	transition prob.
The probability of ending up in a given situation.
		$π^{σ} (h)$	reach probability	$P^{π} (h)$	reach prob.

Equations26

h^{'} = (w^{0}, ante, \dots, bet, w^{5}) = ((\emptyset, \emptyset, 2, 2, 1), ante, \dots, (Q, K, 0, 1, 2))

h^{'} = (w^{0}, ante, \dots, bet, w^{5}) = ((\emptyset, \emptyset, 2, 2, 1), ante, \dots, (Q, K, 0, 1, 2))

s_{1} (h^{'}) = (ante, (\emptyset, ante), (\emptyset, ante), (Q, ? \to 1), (\emptyset, ? \to 2), call, (\emptyset, bet))

s_{1} (h^{'}) = (ante, (\emptyset, ante), (\emptyset, ante), (Q, ? \to 1), (\emptyset, ? \to 2), call, (\emptyset, bet))

v_{i}^{π} (h) := E [R_{i} (τ) ∣ τ generated from π and T, current history is h] .

v_{i}^{π} (h) := E [R_{i} (τ) ∣ τ generated from π and T, current history is h] .

v_{i}^{π} (s_{i}) := s_{i} (h) = s_{i} \sum P^{π} (h ∣ s_{i}) v_{i}^{π} (h), q_{i}^{π} (s_{i}, a) := s_{i} (h) = s_{i} \sum P^{π} (h ∣ s_{i}) q_{i}^{π} (h, a_{i}),

v_{i}^{π} (s_{i}) := s_{i} (h) = s_{i} \sum P^{π} (h ∣ s_{i}) v_{i}^{π} (h), q_{i}^{π} (s_{i}, a) := s_{i} (h) = s_{i} \sum P^{π} (h ∣ s_{i}) q_{i}^{π} (h, a_{i}),

R_{i}^{t} (s_{i}, a_{i}) = \sum_{k = 0}^{t - 1} (q_{i, cf}^{π^{k}} (s_{i}, a_{i}) - v_{i, cf}^{π^{k}} (s_{i}))

R_{i}^{t} (s_{i}, a_{i}) = \sum_{k = 0}^{t - 1} (q_{i, cf}^{π^{k}} (s_{i}, a_{i}) - v_{i, cf}^{π^{k}} (s_{i}))

π^{t} (s_{i}, a_{i}) := (R_{i}^{t} (s_{i}, a_{i}))^{+} / \sum_{a^{'} \in A (s)} (R_{i}^{t} (s, a^{'}))^{+},

π^{t} (s_{i}, a_{i}) := (R_{i}^{t} (s_{i}, a_{i}))^{+} / \sum_{a^{'} \in A (s)} (R_{i}^{t} (s, a^{'}))^{+},

A_{σ τ} = z = ha w \in Z : b_{1} (ha) = σ, b_{2} (ha) = τ \sum P_{T} (z) u_{1} (z) .

A_{σ τ} = z = ha w \in Z : b_{1} (ha) = σ, b_{2} (ha) = τ \sum P_{T} (z) u_{1} (z) .

E_{z} [u_{i} (z) ∣ π] = x^{⊤} A y .

E_{z} [u_{i} (z) ∣ π] = x^{⊤} A y .

x_{\emptyset}

x_{\emptyset}

\sum_{a^{'} \in A (s a o)} x_{s a o a^{'}}

E x = e, x

E x = e, x

F y = f, y

u, y minimize e^{T} u, subj. to F y = f, E^{T} u - A y \geq 0, y \geq 0 .

u, y minimize e^{T} u, subj. to F y = f, E^{T} u - A y \geq 0, y \geq 0 .

I_{pub} (s^{'}) = ⋃ {I_{i} (s_{i}^{'}) ∣ s_{i}^{'} \in S_{i} (s^{'})} .

I_{pub} (s^{'}) = ⋃ {I_{i} (s_{i}^{'}) ∣ s_{i}^{'} \in S_{i} (s^{'})} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Rethinking Formal Models of Partially Observable Multiagent Decision Making

Vojtěch Kovařík111AI Center, FEE, Czech Technical University in Prague (Technická 2, Prague, 166 27, Czech Republic)

Martin Schmid222DeepMind, Alberta, Edmonton

Neil Burch333DeepMind, Alberta, Edmonton

Michael Bowling444DeepMind, Alberta, Edmonton

Viliam Lisý555AI Center, FEE, Czech Technical University in Prague (Technická 2, Prague, 166 27, Czech Republic)

Abstract

Multiagent decision-making in partially observable environments is usually modelled as either an extensive-form game (EFG) in game theory or a partially observable stochastic game (POSG) in multiagent reinforcement learning (MARL). One issue with the current situation is that while most practical problems can be modelled in both formalisms, the relationship of the two models is unclear, which hinders the transfer of ideas between the two communities. A second issue is that while EFGs have recently seen significant algorithmic progress, their classical formalization is unsuitable for efficient presentation of the underlying ideas, such as those around decomposition.

To solve the first issue, we introduce factored-observation stochastic games (FOSGs), a minor modification of the POSG formalism which distinguishes between private and public observation and thereby greatly simplifies decomposition. To remedy the second issue, we show that FOSGs and POSGs are naturally connected to EFGs: by “unrolling” a FOSG into its tree form, we obtain an EFG. Conversely, any perfect-recall timeable EFG corresponds to some underlying FOSG in this manner. Moreover, this relationship justifies several minor modifications to the classical EFG formalization that recently appeared as an implicit response to the model’s issues with decomposition. Finally, we illustrate the transfer of ideas between EFGs and MARL by presenting three key EFG techniques – counterfactual regret minimization, sequence form, and decomposition – in the FOSG framework.

keywords:

Imperfect Information Game , Multiagent Reinforcement Learning , Extensive Form Game , Partially-Observable Stochastic Game , Public Information , Decomposition

††journal: Artificial Intelligence

\forestset

basenode/.style = draw, inner sep = 0, outer ysep = 0, minimum size = anchor = north , playernode/.style=basenode, shape = regular polygon, regular polygon sides = 3, , pl1/.style=playernode, fill=pl1_color, pl2/.style=playernode, fill=pl2_color, shape border rotate=180, chance/.style = basenode, fill=pl0_color, text=chance_text, circle, minimum size=0.75*l sep=0.4765, terminal/.style = basenode, shape = regular polygon, regular polygon sides = 4, l sep=0.47minimum size = 1.07

1 Introduction

Sequential decision-making is one of the core topics of artificial intelligence research. The ability of an artificial agent to perform actions, observe their consequences, and then perform further actions to achieve a goal is instrumental in domains from robotics and autonomous driving to medical decision diagnosis and automated personal assistants. Recent progress has led to unprecedented results in many large-scale problems of this type. While conceptually simpler problems can be modelled with perfect information or by regarding the other agents as stationary parts of the environment, realistic models of real-world situations require rigorous treatment of imperfect information and multiple independent decision-makers operating in a shared environment. The most popular game-theoretical model in these setting – extensive form games (EFG) – dates back to 1953 [33]. EFGs have served the community well, and many impressive results build on top of this particular framework [30, 6, 7].

However, EFGs lose crucial information inherently present in many environments – the notion of observations received by the agents. Observations are essential not only for specifying what information was received, but also to know who received it and when. Since EFGs simply group states indistinguishable to the acting player, the notion of information being public or private is forever lost, and so are the data about the timing. However, these concepts are essential for recent search algorithms [30, 5, 4], where decomposition and reasoning about subgames crucially rely on the notion of public information. While it is common to try to recover the necessary information from the EFG model [10, 22, 47], we show that this is impossible to do in general. Practical implementations thus bypass the model by using algorithms that are built with game-specific concepts (e.g., dealing cards in poker), rather than developing algorithms running purely on top of the information provided by the formal model [30, 6].

We argue that by a minor modification of another popular model – partially-observable stochastic game (POSG) [18] – we can naturally describe domains in a way that preserves this necessary information. While there is a considerable amount of literature on both EFGs and POSGs, the communities using these models do not see much interaction and sharing of results. Therefore, besides slightly augmenting POSGs, we thoroughly analyze the differences and the similarities between the models. We show that we do not need to view POSGs and EFGs as competing models — instead, a POSG can be viewed as the underlying model from which one derives the EFG representation. Since many research questions depend only on those aspects of the game that are preserved by the EFG representation, most of the existing EFG results are directly applicable to POSGs. Moreover, the relationship between POSGs and EFGs provides an opportunity for using novel tools in EFGs, discovering new connections between EFGs and POSGs, and bringing existing results from either community to the attention of a wider audience. However, to fully utilize this relationship, we first need to extend the POSG model with factorized observations.

Outline and Contributions

We introduce factored-observation stochastic game (FOSG), a generalization of the POSG model that factorizes each player’s observations to public and private (Section 2). In Section 3, we show that by “unrolling” a FOSG, we can obtain its tree-shaped extensive-form representation. This representation is similar to the standard formalization of EFGs, except for being augmented by the information about public knowledge and about the players’ knowledge outside of their turn. Note that rather than being “yet another new model”, this formalization of EFGs merely takes several changes to the classical definition that already appeared in various recent works [47, 8, 42] and makes them explicit. While we prefer referring to this formalization simply as “EFG”, this paper shall sometimes use the term “augmented EFG” to distinguish this formalization from the historical one.

We proceed with analyzing the expressive power of the EFG and FOSG models. EFGs are slightly more general, because they enable modelling situations where the environment prevents the agents from tracking time (i.e., non-timeable games [21]) or forces them to forget a particular piece of information they previously learned (i.e., imperfect recall [48]). Importantly, no game played in the real world against people can have these properties. As a result, research in EFGs typically assumes that none of the studied games has these properties (unless the properties are the primary focus of the investigation). From now on, we will, therefore, refer to games without these properties as “well-behaved”.

We show that any well-behaved augmented EFG can be viewed as an extensive-form representation of some FOSG (Theorem 3.6). To connect the result to the historical “non-augmented” formalization of EFGs, we prove that any well-behaved non-augmented EFG can be obtained by stripping the augmented structure from the extensive-form representation of some FOSG (Theorem 3.10). We then explain why this stripping-away process irrecoverably loses some of the information that is necessary for tracking public information in the game (Section 3.3). For a summary of these relationships, see Figure 1. To build more intuitions about the formalism, the paper includes a graphical illustration of EFGs, FOSGs, and of the latter’s better suitability for decomposition (Figures 3 and 3). We also include a “dictionary” of key concepts from reinforcement learning, EFGs, and FOSGs (Table 1).

Section 4 showcases how the model can be used to transfer results between game theory (studied in EFGs) and multiagent reinforcement learning (studied in POSGs). In particular, Section 4.2 re-states in FOSGs the counterfactual regret minimization algorithm, a work-horse that serves as a crucial building block of the recent milestones in EFGs. The FOSG formalism allows to naturally explain this algorithm in terms of state and state-action values, making it particularly accessible to the reinforcement learning community. Section 4.3 shows that the FOSG framework naturally fits with the notion of subgames, decomposition, and subgame solving — key building blocks of the search-based methods that brought recent breakthroughs in large imperfect-information games. Section 4.4 then shows how to apply sequence form – another key technique in developed for EFGs – to FOSGs. Finally, Section 5 reviews the most-related work and presents the main takeaways. The appendix contains the proofs of the presented results.

2 Factored-Observation Stochastic Games

In this section, we describe the factored-observation stochastic game model as a variation on partially-observable stochastic games. Since a key feature of the model is its ability to talk about public information, let us first explain what this concept refers to, why is it important, and why incorporating it into our models is natural, useful, and inexpensive.

2.1 Public Information and Decomposition

A piece of information is said to be common knowledge among a group of agents if all the agents know it, they all know that they know it, they all know that they all know that they know it, and so on [16]. Figuring out which information is common knowledge is often difficult, as it requires putting oneself in shoes of the other agents and accurately reasoning about their thought processes. (The Wikipedia page on common knowledge has a puzzle that illustrates this well [49].) In contrast, a piece of information is public among a group of agents if each agent received it in a way that trivially reveals that the information is common knowledge. Some actions that create public knowledge are (a) speaking out loud in a group (b) placing a card face-up, or (c) moving a piece on a game board.

To see why public information is important, first note that the essence of game theory is that playing well requires knowing which actions the other players might take, which tends to involve reasoning about the information they have. Knowing that some information $I$ is common knowledge is therefore immensely useful for decomposition. Indeed, each player will only reason about situations compatible with $I$ (and about others’ reasoning about situations compatible with $I$ , others’ reasoning about reasoning about situations compatible with $I$ , etc.). Situations compatible with $I$ can, therefore, be considered mostly independently of those incompatible with it (the limited dependence is explained in, e.g., [30]). However, finding out which information is common knowledge can be costly, so the best approach will often be to decompose games based on the subset of common knowledge that is public. Once we can decompose problems into smaller pieces, we become able to solve large problems that would otherwise be intractable. Indeed, we have recently seen breakthroughs in a number of problems that were made computationally feasible by decomposition based on public knowledge — examples include poker [30, 6, 7], Hanabi [17, 28], general EFGs [47, 29, 12], Dec-POMDPs [45, 50], and one-sided POSGs [20].

However, the existing game-theoretic and MARL models do not have a built-in way of describing public knowledge. This is a serious problem because figuring out whether a piece of information is public or not might be difficult (or impossible) unless one remembers how the information was obtained. (Indeed, suppose I know that a friend would be happy to pick me up from an airport, and they know I would be glad if they did. This alone says nothing about whether the information is public among us, so unless we talked about this, I should probably take a taxi. For more technical examples, see Section 3.3.) Making use of decomposition thus typically required adding some component on top of the employed formal model. In addition to requiring a non-trivial conceptual and technical effort, this was often done in a domain-specific manner [30, 7], which made the devised methods difficult to generalize. If the generalization were straightforward, the situation in imperfect-information games would by now likely be closer to that in perfect-information games or single-agent RL, where many of the state-of-the-art algorithms are very general [44, 41, 1]. This suggests that using a model which keeps track of public information by default would have significant benefits.

Fortunately, as witnessed by the examples (a), (b), and (c), the information about which knowledge is public is inherently a part of the description of many games and real-world situations. In other words, public knowledge is typically already a part of a problem’s natural definition; the model we propose merely preserves this information while models used in the past discard it.

2.2 Description of the Basic Model

Informally, the model we are about to describe captures a situation where multiple actors take actions – possibly simultaneously – which influence how the world’s state changes. The new world-state might be more or less desirable for individual actors, which is captured by corresponding reward functions. Rather than having full knowledge of the world state, the agents perceive it through observations. These are further “factored”, such that if some information is public, each agent will know that everybody else also has access to it. The various footnotes in this section explain the motivation behind the presented definitions and suggest possible extensions of the model. These might be helpful for some readers, but can be otherwise skipped. In the following paragraph, $\hookrightarrow$ indicates a partial function, defined on a subset of the stated domain.

A factored-observation stochastic game (FOSG) is a tuple $G=\left<\mathcal{N},\mathcal{W},p,w^{\textnormal{init}},\mathcal{A},\mathcal{T},\mathcal{R},\mathcal{O}\right>$ , where $\mathcal{N}$ is the player set, $\mathcal{W}$ is the set of world states, $w^{\textnormal{init}}$ is a designated initial state, $p:\mathcal{W}\to 2^{\mathcal{N}}$ is a player function, $\mathcal{A}$ is the space of joint actions, $\mathcal{T}:\mathcal{W}\times\mathcal{A}\hookrightarrow\Delta\mathcal{W}$ is the transition function, $\mathcal{R}:\mathcal{W}\times\mathcal{A}\hookrightarrow\mathbb{R}^{\mathcal{N}}$ is the reward function, $\mathcal{O}:\mathcal{W}\times\mathcal{A}\times\mathcal{W}\hookrightarrow\mathbb{O}$ is the observation function666While deterministic observations are the norm in the EFG literature and much of MARL, researchers from fields like robotics can use the straightforward stochastic modification of this setting where $\mathcal{O}$ maps into $\Delta\mathbb{O}$ instead. This is equivalent to the standard setting, since one can always emulate stochastic observations by adding a hidden variable into $w$ ., and we have:

•

$\mathcal{N}=\{1,\dots,N\}$ for some $N\in\mathbb{N}$ .

•

$\mathcal{W}$ is compact. For formal convenience777An initial state with no player allows us to both emulate any initial belief over $\mathcal{W}$ and assume that each player’s information state starts with an observation (which essentially states “Game started. Legal actions = $\{\ldots\}$ .” and is helpful for model-free agents.)., we assume that $p(w^{\textnormal{init}})=\emptyset$ .

•

$\mathcal{A}=\prod_{i\in\mathcal{N}}\mathcal{A}_{i}$ , where each $\mathcal{A}_{i}$ is an arbitrary set of $i$ ’s actions.

–

For each $i\in p(w)$ , $\mathcal{A}_{i}(w)\subset\mathcal{A}_{i}$ denotes a non-empty compact set of $i$ ’s (legal) actions at $w$ . We denote $\mathcal{A}(w):=\prod_{i\in p(w)}\mathcal{A}_{i}(w)$ .

–

We denote $\mathcal{A}_{i}(w):=\{noop\}$ for $i\notin p(w)$ , which allows us to identify each $a\in\mathcal{A}(w)$ with an element of $\prod_{i\in\mathcal{N}}A_{i}(w)$ by appending to it the appropriate number of noop888Recall that noop stands for “no operation” — an action that makes no changes. actions.

•

The transition probabilities $\mathcal{T}(w,a)\in\Delta\mathcal{W}$ , $a\in\mathcal{A}(w)$ , are defined for all $w\in\mathcal{W}$ with non-empty $p(w)$ and for some $w$ with no active players.

–

A world state with $p(w)=\emptyset$ and undefined $\mathcal{T}(w,a)$ is called terminal.

•

$\mathcal{R}(w,a)=(\mathcal{R}_{i}(w,a))_{i\in\mathcal{N}}$ for each non-terminal state $w$ and $a\in\mathcal{A}(w)$ .999As is common in MDPs, we could equivalently consider rewards that depend on $(w,a,w^{\prime})$ , or just on $w^{\prime}$ .

•

$\mathcal{O}$ is factored into private observations and public observations as $\mathcal{O}=(\mathcal{O}_{\textnormal{priv}(1)},\dots,\mathcal{O}_{\textnormal{priv}(N)},\mathcal{O}_{\textnormal{pub}})$ .

–

$\mathbb{O}=\prod_{i\in\mathcal{N}}\mathbb{O}_{\textnormal{priv}(i)}\times\mathbb{O}_{\textnormal{pub}}$ , where $\mathbb{O}_{(\cdot)}$ are arbitrary sets (of possible observations).

–

We assume that $\mathcal{O}_{(\cdot)}(w,a,w^{\prime})\in\mathbb{O}_{(\cdot)}$ is defined for every non-terminal $w$ , $a\in\mathcal{A}(w)$ , and $w^{\prime}$ from the support101010For finite $\mathcal{W}$ , being in support of $\mathcal{T}(w,a)$ is equivalent to having a non-zero probability. of $\mathcal{T}(w,a)$ .

The game proceeds as follows: It starts in the initial state $w^{\textnormal{init}}$ . In each state, each active player $i\in p(w)$ learns which actions are legal for them (either by deduction or by being explicitly told) and selects $a_{i}\in\mathcal{A}_{i}(w)$ . (While the player obviously knows which action they took, they often will not know the actions of the other players, or possibly not even which of them were active at $w$ .) The game then transitions to a new state $w^{\prime}$ , drawn from the distribution $\mathcal{T}(w,a)$ that corresponds to the joint action $a=(a_{i})_{i\in p(w)}$ . This generates the observation $\mathcal{O}(w,a,w^{\prime})$ , from which each player receives $\mathcal{O}_{i}(w,a,w^{\prime}):=\left(\mathcal{O}_{\textnormal{priv}(i)}(w,a,w^{\prime}),{\mathcal{O}}_{\textnormal{pub}}(w,a,w^{\prime})\right)\in\mathbb{O}_{i}$ (i.e., the public observation together with their private observation, in manner which allows distinguishing between the two).111111There are games – such as Hanabi – where information might be public among a subset of players. Such decomposition can be modelled by factoring $\mathcal{O}$ into $(\mathcal{O}_{P})_{P\subset\mathcal{N}}$ , with each player receiving $\mathcal{O}_{i}=((P,\mathcal{O}_{P}))_{i\in P\subset\mathcal{N}}$ . The default FOSG model is a special case of this generalization, with $\mathcal{O}_{P}$ being empty for every $P$ other than either $P=\mathcal{N}$ or $P=\{i\}$ . Finally, each player is assigned the reward $\mathcal{R}_{i}(w,a)$ . However, a player might not know how much reward they received, unless this is – explicitly or implicitly – a part of $\mathcal{O}_{i}$ .121212For example, a thief might steal a wallet without immediately inspecting its contents, or a robot might be optimizing for my preferences while having uncertainty over them [40]. Unobservable rewards also remove incentives to “hide from bad news” or “wirehead” [15]. This process repeats until a terminal state is reached. The goal of each player is to maximize the sum of rewards $\mathcal{R}_{i}(w,a,w^{\prime})$ obtained during the game. Finally, Remark 2.1 describes an important extension of the basic FOSG model:

Remark 2.1 (Chance player).

In many domains, it is useful to view the stochasticity in the game as being caused by a chance player $c$ (also called nature). The chance player receives observations and takes actions like any other player, but their policy (formally defined in Section 2.4) is fixed and publicly known. The addition of $c$ causes no loss of generality since their actions can always be merged back into the transition function $\mathcal{T}$ .

We illustrate the above definitions on the example of Kuhn poker:

Example 2.2 (Kuhn poker as a FOSG).

Kuhn poker is a form of poker where the deck includes only three cards – Jack, Queen, and King. First, each player places one chip into the pot as the initial “forced” bet (ante). Each player is then privately dealt one card (the last card isn’t revealed). This is followed by a betting phase (explained below). The game ends either when one player folds (forfeiting all bets made so far to their opponent) or there is a showdown, where the private cards are revealed and the higher card’s owner receives the bets. At the start of the betting, player one can either check (bet no chips) or bet (one chip). If they check, player two can also check — leading to a showdown — or bet. If one of the players bets, their opponent must either call (betting one chip to match the opponent’s bet), followed by a showdown, or fold.

*One way of modelling the game131313There will typically be multiple ways of modelling a game. For example, we could omit the (strategically trivial) ante actions, deal both private cards at once, or end the game immediately after the call actions. Such more compact model is depicted in Figure 3. as a FOSG is to view world states as tuples $w=(C_{1},C_{2},\textnormal{chips}_{1},\textnormal{chips}_{2},\textnormal{pl})$ , where $C_{i}\in\{\emptyset\}\cup\textnormal{Deck}$ , $\textnormal{Deck}:=\{J,Q,K\}$ , $\textnormal{chips}_{i}\in\{0,1,2\}$ , and $\textnormal{pl}\in\{\emptyset,1,2,\textnormal{c}\}$ correspond to the private cards, remaining chips, and the currently-active player. Ignoring the noop actions for brevity and denoting $w^{\prime}=(\dots,\,\textnormal{chips}^{\prime}_{i},\,\dots)$ , we set $p(w):=\textnormal{pl}$ , $\mathcal{A}_{i}:=\{\textnormal{ante, check, bet, call, fold}\}$ , and $\mathcal{R}_{i}(w,a,w^{\prime}):=\textnormal{chips}^{\prime}_{i}-\textnormal{chips}_{i}$ . Using $C$ , $?\to i$ , and $(C_{1},C_{2})$ as formal shorthands for “your private card is $C$ ”, “unknown card dealt to player $i$ ”, and “showdown: the private cards were $C_{1}$ and $C_{2}$ ”, the observation spaces can be modelled as $\mathbb{O}_{\textnormal{priv}(i)}:=\{\emptyset\}\cup\textnormal{Deck}$ and $\mathbb{O}_{\textnormal{pub}}:=\mathcal{A}_{1}\cup\{?\to 1,?\to 2\}\cup\textnormal{Deck}^{2}$ . To point at the formal definition of $\mathcal{A}_{i}(w)$ , $\mathcal{T}$ , and $\mathcal{O}$ , we look at one trajectory in detail. The game always starts in $w^{0}=w^{\textnormal{init}}=(\emptyset,\emptyset,2,2,1)$ with $\mathcal{A}_{1}(w^{\textnormal{init}})=\{\textnormal{ante}\}$ , from which it transitions to $w^{1}=(\emptyset,\emptyset,1,2,2)$ . (Formally, $\mathcal{T}(w^{0},\textnormal{ante},w)$ is $1$ when $w=w^{1}$ and [math] otherwise.) In $w^{1}$ , we have $\mathcal{A}_{2}(w^{1})=\{\textnormal{ante}\}$ and the game transitions to $w^{2}=(\emptyset,\emptyset,1,1,\textnormal{c})$ . Private cards are dealt next, via a uniformly random transition; first to $w^{3}=(C_{1},\emptyset,1,1,\textnormal{c})$ , $C_{1}\in\textnormal{Deck}$ , and then to $w^{4}=(C_{1},C_{2},1,1,1)$ , $C_{2}\in\textnormal{Deck}\setminus\{C_{1}\}$ . Suppose that $C_{1}=Q$ and $C_{2}=K$ . We have $\mathcal{A}_{1}(w^{4})=\{\textnormal{check, bet}\}$ and if player one bets, the game transitions to $w^{5}=(Q,K,0,1,2)$ , $\mathcal{A}_{2}(w^{5})=\{\textnormal{call, fold}\}$ . If player two responds by calling, the game transitions first to $w^{6}=(Q,K,0,0,\textnormal{c})$ and then to the terminal state $w^{7}=(Q,K,0,4,\emptyset)$ (since $K$ is higher than $Q$ ). The corresponding observations $\mathcal{O}_{1}(w^{k-1},\,\cdot\,,w^{k})$ are:

$(\emptyset,\textnormal{ante})$
$(\emptyset,\textnormal{ante})$
$(Q,?\to 1)$
$(\emptyset,?\to 2)$
$(\emptyset,\textnormal{bet})$
$(\emptyset,\textnormal{call})$
$(\emptyset,(Q,K))$ .

2.3 FOSGs as an Extension of POSGs

There are many formalizations and variants of POSGs in the literature [39, 18, 14, 20]. However, they all behave similarly to the following definition:

Definition 2.3 (POSG).

A partially-observable stochastic game is a FOSG in which $\mathcal{O}_{\textnormal{pub}}$ is constantly equal to $\emptyset$ and $p(w)=\mathcal{N}$ for all $w^{\textnormal{init}}\neq w\in\mathcal{W}$ .

For the purposes of this text, we can thus view POSGs as a special case of FOSGs (i.e., those in which the player function and all public observations are trivial). In particular, this shows that FOSGs can be viewed as a minor extension of POSGs.

The relationship is even stronger since forgetting the factorization of observations yields a POSG that is strategically equivalent to the original FOSG (i.e., each player has the same information as before, makes the same choices, and these lead to the same outcomes). Indeed, for a FOSG $G=\left<\mathcal{N},\mathcal{W},p,w^{\textnormal{init}},\mathcal{A},\mathcal{T},\mathcal{R},\mathcal{O}\right>$ we can denote by $\texttt{ForgetFactorization}(G)$ the FOSG $\left<\mathcal{N},\mathcal{W},p^{\prime},w^{\textnormal{init}},\mathcal{A},\mathcal{T},\mathcal{R},\mathcal{O}^{\prime}\right>$ , where $p^{\prime}(w):=\mathcal{N}$ for $w\neq w^{\textnormal{init}}$ , $\mathcal{O}^{\prime}_{\textnormal{priv}(i)}:=\mathcal{O}_{i}$ , and $\mathcal{O}^{\prime}_{\textnormal{pub}}(\cdot)=\emptyset$ . We immediately get the following:

Proposition 2.4.

For every FOSG $G$ , $G^{\prime}=\texttt{ForgetFactorization}(G)$ is a partially-observable stochastic game.

Nevertheless, the resulting POSG will be different in one potentially important aspect: to reason about public information, we will first need to recover the factorization that we chose to forget in the first place. This implies that the POSG and FOSG frameworks are formally equivalent, and any research done in one will transfer to the other. However, the latter will be more convenient for methods which can take advantage of decomposition.

2.4 Histories, Information States, and Public States

With access to the underlying FOSG model, we can consider several derived objects which allow for its more nuanced analysis.

To describe the real state of affairs in $G$ , the basic concept is that of a trajectory — a finite sequence $\tau=w^{0}a^{0}w^{1}a^{1}\dots w^{t}\in(\mathcal{W}\mathcal{A})^{*}\mathcal{W}$ for which $a^{k}\in\mathcal{A}(w^{k})$ , $w^{k+1}\in\textnormal{supp}\,\mathcal{T}(w^{k},a^{k})$ — and $i$ ’s cumulative reward $\mathcal{R}_{i}(\tau):=\sum_{k=0}^{t-1}\mathcal{R}_{i}(w^{k},a^{k})$ along $\tau$ . By $\mathcal{H}$ we denote the set of all histories in $G$ , i.e., all trajectories that start in the initial state $w^{\textnormal{init}}$ . Using the symbol $\sqsupset$ to denote the fact that one sequence extends another, we endow $\mathcal{H}$ with the structure of a tree. Using the set of terminal histories141414Since the last state in each $\tau$ uniquely defined, the notation for world-states can be overloaded to work with trajectories and histories as well. That is, $h=w^{0}\ldots w^{k}$ is said to be terminal if $w^{k}$ is, we set $\mathcal{A}(h):=\mathcal{A}(w^{k})$ , etc. $\mathcal{Z}:=\{z\in\mathcal{H}\mid z\textnormal{ is terminal}\}$ , we define $i$ ’s utility function $u_{i}:\mathcal{Z}\to\mathbb{R}$ as $u_{i}(z):=\mathcal{R}_{i}(z)$ .

To describe the point of view of an individual agent $i$ , we talk about information states $s_{i}$ . To simplify the notation, we assume that $i$ can always deduce which actions are legal from other information available to them. Initially, each player’s information state is an empty sequence $s_{i}=\emptyset$ . When $i$ receives an observation $O_{i}=\mathcal{O}_{i}(w,a,w^{\prime})$ , resp. takes a non-noop action $a_{i}$ , $s_{i}$ changes to $s^{\prime}_{i}=s_{i}O_{i}$ , resp. $s^{\prime}_{i}=s_{i}a_{i}$ . In light of this definition, information states can also be called action-observation histories. Each history corresponds to some information state $s_{i}(h)$ that ends by an observation and, by our assumption, we have $\mathcal{A}_{i}(h)=\mathcal{A}_{i}(g)=:\mathcal{A}_{i}(s_{i})$ whenever $s_{i}(h)=s_{i}(g)=s_{i}$ . When defining FOSGs, we have assumed that whenever the world-state changes, all players receive some observation (i.e., $\mathcal{O}_{i}(w,a,w^{\prime})$ is defined for all legal transitions). This is equivalent to allowing the players to write even noop actions into their $s_{i}$ -s and enables them to always deduce the number of transitions that have occurred so far. This assumption has non-trivial – but mostly positive – consequences, as we will see in Theorem 3.10.151515Arguably, it would be more natural to allow observation functions which are only defined on a subset of legal transitions and assume that players for whom $\mathcal{O}_{i}(w,a,w^{\prime})$ is undefined are oblivious to the fact that the transition has happened. Unfortunately, this makes all EFG-related concepts more complicated (by introducing “thick infosets” [42] and “non-timeability” [21]), so we have decided to not include this option in the default FOSG model.

The information-state tree of $i$ is the set $\mathcal{S}_{i}$ of all $s_{i}$ -s that might arise in $G$ . The behavioural strategy (also called policy) of $i$ is a mapping $\pi_{i}:\mathcal{S}_{i}\hookrightarrow\Delta\mathcal{A}_{i}$ s.t. $\pi_{i}(s_{i})\in\Delta\mathcal{A}_{i}(s_{i})$ for every $s_{i}$ where $i$ is supposed to submit an action. Any strategy profile $\pi=(\pi_{i})_{i\in\mathcal{N}}$ induces a probability distribution over terminal histories. We will henceforth assume that the game is guaranteed to always end up in a terminal state161616This can be ensured by assuming that $\mathcal{W}$ is finite and $\mathcal{T}$ admits no loops or that $\mathcal{T}$ makes infinite loops or chains vanishingly unlikely, or by enforcing a finite horizon $T$ ., which allows us to define the expected utility $u_{i}(\pi):=\mathbf{E}_{z\sim\pi}\,u_{i}(z)$ and talk about various solution concepts in FOSGs (such as Nash equilibria, in which no player can increase their expected utility by deviating unilaterally from the current strategy profile).171717Note that the computational requirements of many algorithms (e.g., CFR [52]) will scale with the size of $\mathcal{S}_{i}$ , which can be exponentially larger than the size of $\mathcal{W}$ . Indeed, many algorithms reason about policies and each policy $\pi_{i}$ specifies a probability distribution over $\mathcal{A}_{i}$ for every information state in $\mathcal{S}_{i}$ . To see that $|\mathcal{S}_{i}|$ is potentially much larger than $|\mathcal{W}|$ , note that (a) trajectories in $G$ can visit states multiple times, so there are potentially (exponentially) more trajectories than world states and (b) depending on how much information about $G$ a player has, they can have up to as many infostates as there are trajectories.

To describe the point of view of an “external observer”, it is useful to consider the notion of public states. Formally, a public state $s_{\textnormal{pub}}=s_{\textnormal{pub}}(h)$ corresponding to a history $h$ is a sequence of public observation received along $h$ . By $\mathcal{S}_{\textnormal{pub}}$ , we denote the public tree which consists of all public states that can arise in $G$ .

Note that since each player has their observations factored into $\mathcal{O}_{\textnormal{priv}(i)}$ and $\mathcal{O}_{\textnormal{pub}}$ , we can equally-well recover $s_{\textnormal{pub}}$ from any $s_{i}=s_{i}(h)$ . And while this fact might seem essentially trivial, it makes for a very powerful decomposition tool in FOSGs. Indeed, if the current public state is $s_{\textnormal{pub}}$ , each player knows that all other players can condition their strategy on $s_{\textnormal{pub}}$ , so each player can restrict their reasoning to situations that are compatible with $s_{\textnormal{pub}}$ . Decomposition in FOSGs is further discussed in Section 4.3.

For example, the first two transitions described in Example 2.2 generate the history $h=\left((\emptyset,\emptyset,2,2,1),\textnormal{ante},(\emptyset,\emptyset,1,2,2),\textnormal{ante},(\emptyset,\emptyset,1,1,\textnormal{c})\right)$ . As this is unwieldly for long histories, we can identify each such $h^{\prime}$ with a shorter sequence of events that uniquely identify it. For example, the history

[TABLE]

(where $w^{k}$ are as in Example 2.2) could be more succinctly written as $h^{\prime}=(Q,K,\textnormal{bet})$ . The corresponding infostate and public state are

[TABLE]

and $s_{\textnormal{pub}}=(\textnormal{ante},\textnormal{ante},?\to 1,?\to 2,\textnormal{bet})$ , which can be identified with just $(Q,\textnormal{bet})$ and $(\textnormal{bet})$ .181818Note that omitting the observation $?\to 2$ before taking the bet action would make indistinguishable the situations where player two already has, resp. doesn’t have, their private card. For a visual illustration, see Figure 3.

3 Extensive Form Representation of FOSGs

In this section, we describe and analyze extensive-form representations of FOSGs. Informally, such representations are constructed by “unrolling” the possibly-cyclic space $\mathcal{W}$ into the tree-structured set $\mathcal{H}$ . This makes them less compact than the default model but, as demonstrated in Section 4, more suitable for various modern algorithms.

In Section 3.1, we start with an arbitrary FOSG and construct its extensive-form representation. This construction does not assume prior knowledge of EFGs — in fact, one goal of this section is to show that even if EFGs did not yet exist, they would naturally arise as an object derived from FOSGs. (Readers familiar with the EFG literature will notice that the resulting structure is subtly yet crucially different from the historical formalization of EFGs. However, rather than “introducing yet another new model”, the proposed definition essentially amounts to gathering in one place several modifications that were already present in several existing publications.) We also show that EFGs can be considered even in the absence of the underlying FOSG and prove that while this might introduce certain complications, this never happens with EFGs derived from FOSGs. In Section 3.2, we compare the proposed formalization of EFGs with the historical one. Formally, the two definitions are near-identical, except that the latter lacks the means to describe public information and information available to players outside of their turn. The upshot is that if we start with extensive-form representations of FOSGs and throw the given information away, we obtain precisely all timeable perfect-recall EFGs. (Both of these limitations are desirable: while we explain that imperfect recall can be added back if needed, [21] argues that non-timeable games are pathological and of no practical interest). In Section 3.3, we demonstrate why the information thrown away by classical EFGs is fundamentally irrecoverable and explain some of the resulting the drawbacks.

3.1 Extensive Form Representation of FOSGs

In this section, we project the structure of information- and public-states onto the history tree $\mathcal{H}$ , which allows us to discard the original FOSG and only focus on its tree representation. However, before doing so, we show that every FOSG has a version in which only one player acts at any given time191919Technically speaking, this step isn’t necessary for any of the theory. However, it seems useful purely on the grounds of making the resulting representations more similar to classical EFGs.:

Definition 3.1 (Serial FOSG).

A FOSG $G$ is said to be serial if for each $w\in\mathcal{W}$ , either $p(w)=\{i\}$ and all transitions $\mathcal{T}(w,a)$ are deterministic or $p(w)=\emptyset$ . A non-terminal $w$ with $p(w)=\emptyset$ is called a chance node.

The serialized version of $G$ , denoted $G^{\prime}:=\texttt{Serialize}(G)$ , can be constructed by having all active players select actions one by one — formally transitioning to a new state, but not generating any non-trivial rewards or observations — and then adding a chance node which determines the next world-state and generates rewards and observations. The construction is straightforward, results in a strategically equivalent FOSG, and can be viewed as canonical (up to the order of players). It is formally described in the appendix (Example B.1).

Lemma 3.2.

For every FOSG $G$ , the FOSG $G^{\prime}=\texttt{Serialize}(G)$ is serial.

Information in FOSGs is primarily expressed in terms of information states which “live in the heads of agents”. However, information can also be modelled using information sets which “live” in the history tree $\mathcal{H}$ :

Definition 3.3 (Information sets).

Let $s_{i}\in\mathcal{S}_{i}$ be a information state in $G$ . An information set (infoset) corresponding to $s_{i}$ is defined as $I_{i}(s_{i}):=\{h\in\mathcal{H}\mid s_{i}(h)=s_{i}\}$ . The collection $\mathcal{I}_{i}$ of all $i$ ’s non-empty202020There are information states of $i$ — those that end in an action — that are not associated with any particular history, and thus correspond to an empty information set. infosets is called $i$ ’s information partition.

For example, in Kuhn poker, the infostate $s_{1}(Q,?\to 2,\textnormal{bet})$ corresponds to the infoset $I_{1}=\{(Q,J,\textnormal{bet}),(Q,K,\textnormal{bet}\}$ .

Since the space of information-states $\mathcal{S}_{i}$ is endowed with the tree structure given by the extension relation $s_{i}\sqsubset t_{i}$ , we can likewise turn $\mathcal{I}_{i}$ into a tree by saying that an infoset $J\in\mathcal{I}_{i}$ is an extension of $I\in\mathcal{I}_{i}$ if we have $J=I_{i}(t_{i})$ and $I=I_{i}(s_{i})$ for some $s_{i}\sqsubset t_{i}$ .212121Since $i$ is assumed to have perfect recall, this is equivalent to each history from $J$ being an extension of some history from $I$ , which is further equivalent to the existence of a single history from $J$ that is an extension of some history from $I$ . The tree of public states can be defined analogously:

Definition 3.4 (Public sets).

Let $s_{\textnormal{pub}}\in\mathcal{S}_{\textnormal{pub}}$ be a public state in $G$ . A public set corresponding to $s_{\textnormal{pub}}$ is defined as $I_{\textnormal{pub}}(s_{\textnormal{pub}}):=\{h\in\mathcal{H}\mid s_{\textnormal{pub}}(h)=s_{\textnormal{pub}}\}$ . The collection $\mathcal{I}_{\textnormal{pub}}$ of all all non-empty public sets is called the public partition.

Extensive form games are obtained by abstracting away the original FOSG and only looking at the corresponding trees of histories, infosets, and public sets:

Definition 3.5 (EFG).

An (augmented) extensive form game222222Strictly speaking, this definition is novel. However, explicit attempts to augment $\mathcal{I}_{i}$ -s to the full $\mathcal{H}$ already appear in [10]. The first definition of public sets in the CFR literature comes from [22]. is a tuple $E=\left<\mathcal{N},\mathcal{A},\mathcal{H},p,\pi_{c},\mathcal{I},u\right>$ for which

•

$\mathcal{N}=\{1,\dots,N\}$ * for some $N\in\mathcal{N}$ ,*

•

$\mathcal{H}$ * is a tree232323Recall that a “tree on $X$ ” refers to a subset of $X^{*}$ that is closed under initial segments. on $\mathcal{A}$ ,*

•

$\mathcal{A}$ * and all sets $\mathcal{A}(h):=\{a\in\mathcal{A}\mid ha\in\mathcal{H}\}$ , for $h\in\mathcal{H}$ , are compact,*

•

$p:\mathcal{H}\setminus\mathcal{Z}\to\mathcal{N}\cup\{c\}$ * (where $\mathcal{Z}$ denotes the leaves of $\mathcal{H}$ ),*

•

$\pi_{c}(h)\in\Delta\mathcal{A}(h)$ * for $p(h)=c$ ,*

•

$u:\mathcal{Z}\to\mathbb{R}^{\mathcal{N}}$ , and

•

$\mathcal{I}=(\mathcal{I}_{1},\dots,\mathcal{I}_{N},\mathcal{I}_{\textnormal{pub}})$ * is a collection of partitions of $\mathcal{H}$ , where each $\mathcal{I}_{i}$ *

–

is a refinement242424Recall that $\mathcal{P}$ is a refinement of $\mathcal{Q}$ if each $P\in\mathcal{P}$ is a subset of exactly one $Q\in\mathcal{Q}$ .* of $\mathcal{I}_{\textnormal{pub}}$ and*

–

provides enough information to identify $i$ ’s legal actions.252525That is, for every $I\in\mathcal{I}_{i}$ , $p(h)$ is either equal to $i$ for all $h\in I$ or for no $h\in I$ . If for all, then $\mathcal{A}(h)$ doesn’t depend on the choice of $h\in I$ .

To avoid various pathologies, we might wish to additionally require perfect recall and “no thick infosets”: $E$ is with perfect recall if for each $g,h\in I\in\mathcal{I}_{i}$ , $i$ ’s action-infoset histories262626Where $i$ ’s action-infoset history corresponding to $h$ is the sequence of infosets encountered by $i$ and actions taken by $i$ on the way to $h$ . corresponding to $g$ and $h$ coincide. $E$ is said to not have thick public sets if no element of $\mathcal{I}_{\textnormal{pub}}$ (and hence of $\mathcal{I}_{i}$ ) contains both some $h$ and its strict extension.

Moreover, any FOSG $G$ canonically272727Up to the order of players and some formal details. corresponds to some extensive-form representation $\texttt{ExtForm}\left(G\right)$ . Informally speaking, $\texttt{ExtForm}\left(G\right)$ is obtained by serializing $G$ , only keeping the objects that appear in Definition 3.5, and formally modifying them until it fits the definition. (The formal construction is described in the proof of Theorem 3.6.) Not only does this construction indeed produce an augmented EFG, we even get the two convenient assumptions for free:

Theorem 3.6 (FOSGs as (augmented) EFGs).

Every FOSG $G$ corresponds to an (augmented) EFG $E=\texttt{ExtForm}\left(G\right)$ with perfect-recall and no thick public sets. Moreover, any perfect-recall augmented EFG without thick public sets can be obtained this way.

Note that Theorem 3.6 does not imply that FOSGs are unsuitable for studying imperfect recall abstractions [11, 24]. Indeed, imperfect recall can be introduced either (1) in the underlying FOSG, as a property of a specific agent, or (2) as an abstraction of the EFG representation derived from the underlying FOSG. In particular, the second option essentially makes all prior EFG research on imperfect recall applicable to our setting as well.

3.2 Comparison of FOSGs and Classical EFGs

We are now ready to make a comparison between FOSGs and EFGs. We will show that any “well-behaved” EFG can be obtained by starting with an extensive-form representation of some FOSG and throwing away all information about what players know outside of their turn (and hence also about public knowledge in the game). We will also give several examples which show that once lost, the public information is very difficult – and sometimes even impossible – to put back. We start by describing classical EFGs and explaining what we mean when referring to “well-behaved” EFGs.

In light of the augmented definition of EFGs, classical EFGs can be defined as follows:

Definition 3.7 (Historical definition of EFGs).

A classical extensive-form game $E^{\prime}$ is a tuple $\left<\mathcal{N},\mathcal{A},\mathcal{H},p,\pi_{c},\mathcal{I}^{\prime},u\right>$ , where all the objects are as in Definition 3.5, except that $\mathcal{I}^{\prime}$ is of the form $\mathcal{I}^{\prime}=(\mathcal{I}^{\prime}_{1},\dots,\mathcal{I}^{\prime}_{N})$ , where $\mathcal{I}^{\prime}_{i}$ only covers $\{h\in\mathcal{H}\mid p(h)=i\}$ .

We thus see that the only formal difference between classical and augmented EFGs is the absence of the public partition $\mathcal{I}_{\textnormal{pub}}$ and the fact that classical infosets are only defined for the active player. In particular, any augmented EFG $E$ can be turned into a classical EFG $E^{\prime}=\texttt{ForgetNonActingI}(E)$ throwing away $E$ ’s public partition and restricting each $\mathcal{I}_{i}$ to the histories where $i$ acts. For an intuitive comparison between the two formalisms, we invite the reader to compare Figures 3 and 3.

As with augmented EFGs, we can restrict our attention to games with perfect recall. However, the “no thick infosets” condition is both trivial and meaningless (since in a classical EFG, infosets are only defined when the player acts, so they cannot be thick unless the player has imperfect recall). This opens the door to a different pathological property – non-timeability:

Definition 3.8 (Timeability, [21]).

For a [classical] extensive-form game, a deterministic timing is a labelling of the nodes in $\mathcal{H}$ with non-negative real numbers such that the label of any node is at least one higher than the label of its parent. A deterministic timing is exact if any two nodes in the same information set have the same label.

We can then call a game timeable if it admits an exact deterministic timing and 1-timeable if each label is exactly one higher than its parent’s label. The difference between timeability and 1-timeability is, however, insignificant:

Lemma 3.9 (Equivalence of timeable and 1-timeable EFGs).

Any classical timeable EFG can be made 1-timeable by adding chance nodes with a single noop action.

In Appendix D, we show that this modification will not increase the size of the game more than quadratically.

Proof sketch.

This can be shown by first realizing that any exact deterministic timing can, without loss of generality, be integer-valued. The equivalent 1-timeable game is then obtained by adding a single “dummy” chance node for any integer that the timing skipped on the way between some node $ha$ and its parent $h$ . ∎

There are two classes of non-timeable games. The more benign case are games where one player might take an arbitrarily high (but finite) number of actions between two actions of their opponent (for example, suppose I could keep eating cookies for as long as I keep rolling six on a dice).282828If we allowed for timings that only require the label to be higher (rather than “at least one higher”) than the parent label, timeability would encompass these games as well. The truly obscure case are games whose non-timeability is caused by cyclic time-dependencies between nodes. (For example, the game depicted in Figure 5 has histories $g,g^{\prime}\in I\in\mathcal{I}^{\prime}_{1}$ and $h,h^{\prime}\in J\in\mathcal{I}^{\prime}_{2}$ for which $g$ is a parent of $h$ but $h^{\prime}$ is a parent of $g^{\prime}$ .) The authors of [21] argue for avoiding non-timeable games because rather than corresponding real-world problems, such games are merely an artefact of the EFG formalism.

With this terminology, we are finally ready to describe the relation between FOSG and EFGs. Denoting by $E^{\prime}=:\texttt{ClassicalEFG}\left(G\right)$ the classical EFG obtained by turning $E=\texttt{ExtForm}\left(G\right)$ into $E^{\prime}=\texttt{ForgetNonActingI}(E)$ , we have the following:

Theorem 3.10 (FOSGs as timeable EFGs).

For any FOSG $G$ , the game $E^{\prime}=\texttt{ClassicalEFG}\left(G\right)$ is a classical 1-timeable perfect-recall EFG. Moreover, any classical 1-timeable292929By Lemma 3.9, 1-timeable EFGs have the same expressive power as timeable ones — this makes the 1-timeability assumption a formality. Moreover, if we were to retract the assumption that players receive observations on every transition, we would be able to represent non-timeable games as well. However, the desirability of doing so is debatable. perfect-recall303030Regarding imperfect recall in FOSGs, see the remark below Theorem 3.6. EFG can be obtained this way.

In the remainder of this section, we look at some of the downsides of classical EFGs.

3.3 Unsuitability of Classical EFGs for Decomposition

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Badia et al. [2020] Badia, A.P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, D., Blundell, C., 2020. Agent 57: Outperforming the atari human benchmark. ar Xiv preprint ar Xiv:2003.13350 .
2Billings et al. [2003] Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., Szafron, D., 2003. Approximating game-theoretic optimal strategies for full-scale poker, in: IJCAI, pp. 661–668.
3Boutilier et al. [2000] Boutilier, C., Dearden, R., Goldszmidt, M., 2000. Stochastic dynamic programming with factored representations. Artificial intelligence 121, 49–107.
4Brown et al. [2020] Brown, N., Bakhtin, A., Lerer, A., Gong, Q., 2020. Combining deep reinforcement learning and search for imperfect-information games. ar Xiv preprint ar Xiv:2007.13544 .
5Brown and Sandholm [2017 a] Brown, N., Sandholm, T., 2017 a. Safe and nested subgame solving for imperfect-information games, in: Advances in Neural Information Processing Systems, pp. 689–699.
6Brown and Sandholm [2017 b] Brown, N., Sandholm, T., 2017 b. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science , eaao 1733.
7Brown and Sandholm [2019] Brown, N., Sandholm, T., 2019. Superhuman AI for multiplayer poker. Science 365, 885–890.
8Brown et al. [2018] Brown, N., Sandholm, T., Amos, B., 2018. Depth-limited solving for imperfect-information games, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7674–7685.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Rethinking Formal Models of Partially Observable Multiagent Decision Making

Abstract

keywords:

1 Introduction

Outline and Contributions

2 Factored-Observation Stochastic Games

2.1 Public Information and Decomposition

2.2 Description of the Basic Model

Remark 2.1** (Chance player).**

Example 2.2** (Kuhn poker as a FOSG).**

2.3 FOSGs as an Extension of POSGs

Definition 2.3** (POSG).**

Proposition 2.4**.**

2.4 Histories, Information States, and Public States

3 Extensive Form Representation of FOSGs

3.1 Extensive Form Representation of FOSGs

Definition 3.1** (Serial FOSG).**

Lemma 3.2**.**

Definition 3.3** (Information sets).**

Definition 3.4** (Public sets).**

Definition 3.5** (EFG).**

Theorem 3.6** (FOSGs as (augmented) EFGs).**

3.2 Comparison of FOSGs and Classical EFGs

Definition 3.7** (Historical definition of EFGs).**

Definition 3.8** (Timeability, [21]).**

Lemma 3.9** (Equivalence of timeable and 1-timeable EFGs).**

Proof sketch.

Theorem 3.10** (FOSGs as timeable EFGs).**

3.3 Unsuitability of Classical EFGs for Decomposition

Remark 2.1 (Chance player).

Example 2.2 (Kuhn poker as a FOSG).

Definition 2.3 (POSG).

Proposition 2.4.

Definition 3.1 (Serial FOSG).

Lemma 3.2.

Definition 3.3 (Information sets).

Definition 3.4 (Public sets).

Definition 3.5 (EFG).

Theorem 3.6 (FOSGs as (augmented) EFGs).

Definition 3.7 (Historical definition of EFGs).

Definition 3.8 (Timeability, [21]).

Lemma 3.9 (Equivalence of timeable and 1-timeable EFGs).

Theorem 3.10 (FOSGs as timeable EFGs).