Solving Simple Stochastic Games with few Random Nodes faster using   Bland's Rule

David Auger; Pierre Coucheney; Yann Strozecki

arXiv:1901.05316·cs.DS·January 17, 2019

Solving Simple Stochastic Games with few Random Nodes faster using Bland's Rule

David Auger, Pierre Coucheney, Yann Strozecki

PDF

TL;DR

This paper introduces a faster algorithm for solving simple stochastic games with few random nodes by adapting Bland's rule, reducing randomness and improving expected running time.

Contribution

It presents a simplified iterative algorithm using Bland's rule, achieving exponential speed-up for games with limited random nodes.

Findings

01

Expected running time of 2^{O(k)} for k random nodes

02

Reduced randomness compared to Ludwig's algorithm

03

Applicable to general random nodes with arbitrary outdegree

Abstract

The best algorithm so far for solving Simple Stochastic Games is Ludwig's randomized algorithm which works in expected $2^{O (n)}$ time. We first give a simpler iterative variant of this algorithm, using Bland's rule from the simplex algorithm, which uses exponentially less random bits than Ludwig's version. Then, we show how to adapt this method to the algorithm of Gimbert and Horn whose worst case complexity is $O (k!)$ , where $k$ is the number of random nodes. Our algorithm has an expected running time of $2^{O (k)}$ , and works for general random nodes with arbitrary outdegree and probability distribution on outgoing arcs.

Figures3

Click any figure to enlarge with its caption.

Figure 2

Equations30

f^{Θ} (σ_{0}, F) = f^{Θ} (σ_{0}, F \cup {v_{0}})

f^{Θ} (σ_{0}, F) = f^{Θ} (σ_{0}, F \cup {v_{0}})

f^{Θ} (σ_{0}, F) = f^{Θ} (σ_{0}, F \cup {v_{0}}) + 1 + f^{Θ} (σ_{2}, F) .

f^{Θ} (σ_{0}, F) = f^{Θ} (σ_{0}, F \cup {v_{0}}) + 1 + f^{Θ} (σ_{2}, F) .

σ^{'} \leq opt (σ^{'}, F \cup {v}) = opt (σ_{0}, F \cup {v}) .

σ^{'} \leq opt (σ^{'}, F \cup {v}) = opt (σ_{0}, F \cup {v}) .

opt (σ_{0}, F \cup {v_{0}}) < σ_{2} \leq σ^{'} .

opt (σ_{0}, F \cup {v_{0}}) < σ_{2} \leq σ^{'} .

opt (σ_{0}, F \cup {v_{0}}) < opt (σ_{0}, F \cup {v}) .

opt (σ_{0}, F \cup {v_{0}}) < opt (σ_{0}, F \cup {v}) .

∣ {x : a (x) \leq i} ∣ \leq i .

∣ {x : a (x) \leq i} ∣ \leq i .

Φ (n, k) = G, σ, H sup E^{Θ} [f_{G}^{Θ} (σ, H)]

Φ (n, k) = G, σ, H sup E^{Θ} [f_{G}^{Θ} (σ, H)]

E^{Θ} [f^{Θ} (σ, H)] \leq E^{Θ} [f^{Θ} (σ, H \cup {v_{0}^{t}})] + 1 + E^{Θ} [f^{Θ} (σ^{'}, H \cup B^{t}})] .

E^{Θ} [f^{Θ} (σ, H)] \leq E^{Θ} [f^{Θ} (σ, H \cup {v_{0}^{t}})] + 1 + E^{Θ} [f^{Θ} (σ^{'}, H \cup B^{t}})] .

E^{Θ} [f^{Θ} (σ, H \cup {v})] \leq Φ (n, k + 1)

E^{Θ} [f^{Θ} (σ, H \cup {v})] \leq Φ (n, k + 1)

E^{Θ} [f^{Θ} (σ, H \cup {v_{0}^{t}})] \leq Φ (n, k + 1) .

E^{Θ} [f^{Θ} (σ, H \cup {v_{0}^{t}})] \leq Φ (n, k + 1) .

E^{Θ} [f^{Θ} (σ^{'}, H \cup B^{t}})]

E^{Θ} [f^{Θ} (σ^{'}, H \cup B^{t}})]

\leq i = 1 \sum n - k P^{t} (∣ B^{t} ∣ = i) \cdot Φ (n, k + i) .

\sum_{i=1}^{n-k}{\mathbb{P}}^{t}\left(|B^{t}|\leq i\right)\cdot\big{(}\Phi(n,k+i)-\Phi(n,k+i+1)\big{)}

\sum_{i=1}^{n-k}{\mathbb{P}}^{t}\left(|B^{t}|\leq i\right)\cdot\big{(}\Phi(n,k+i)-\Phi(n,k+i+1)\big{)}

P^{t} (∣ B^{t} ∣ \leq i) \leq \frac{i}{n - k}

P^{t} (∣ B^{t} ∣ \leq i) \leq \frac{i}{n - k}

E^{Θ} [f^{Θ} (σ^{'}, H \cup B^{t}})]

E^{Θ} [f^{Θ} (σ^{'}, H \cup B^{t}})]

= i = 1 \sum n - k \frac{1}{n - k} \cdot Φ (n, k + i) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

DAVID laboratory, University of Versailles Saint-Quentin-en-Yvelines, [email protected] laboratory, University of Versailles Saint-Quentin-en-Yvelines, [email protected] DAVID laboratory, University of Versailles Saint-Quentin-en-Yvelines, [email protected] \CopyrightDavid Auger and Pierre Coucheney and Yann Strozecki\ccsdesc[500]Theory of computation Algorithmic game theory \supplement\funding

Acknowledgements.

\EventEditorsRolf Niedermeier and Christophe Paul \EventNoEds2 \EventLongTitle36th International Symposium on Theoretical Aspects of Computer Science (STACS 2019) \EventShortTitleSTACS 2019 \EventAcronymSTACS \EventYear2019 \EventDateMarch 13–16, 2019 \EventLocationBerlin, Germany \EventLogo \SeriesVolume126 \ArticleNo58

Solving Simple Stochastic Games with few Random Nodes faster using Bland’s Rule

David Auger

Pierre Coucheney

Yann Strozecki

Abstract

The best algorithm so far for solving Simple Stochastic Games is Ludwig’s randomized algorithm [21] which works in expected $2^{O(\sqrt{n})}$ time. We first give a simpler iterative variant of this algorithm, using Bland’s rule from the simplex algorithm, which uses exponentially less random bits than Ludwig’s version. Then, we show how to adapt this method to the algorithm of Gimbert and Horn [15] whose worst case complexity is $O(k!)$ , where $k$ is the number of random nodes. Our algorithm has an expected running time of $2^{O(k)}$ , and works for general random nodes with arbitrary outdegree and probability distribution on outgoing arcs.

keywords:

simple stochastic games, randomized algorithm, parametrized complexity, strategy improvement, Bland’s rule

category:

\relatedversion

1 Introduction

A simple stochastic game, SSG for short, is a two-player zero-sum game, a turn-based version of stochastic games introduced by Shapley [22]. SSGs were introduced by Condon [11] and provide a simple framework that allows to study algorithmic complexity issues underlying reachability objectives. An SSG is played by moving a pebble on a graph. Some nodes are divided between players min and max: if the pebble reaches a node controlled by a player then she has to move the pebble along an arc leading to another node. Some other nodes are ruled by chance, the pebble following one outgoing arc according to some given probability distribution. Finally, there are sink nodes with a rational value, which is the gain that max-player achieves when the pebble reaches this sink.

Player max’s objective is, given a starting node for the pebble, to maximize the expectation of her gain against any strategy of min. One can show that it is enough to consider stationary deterministic strategies for both players [11]. Though seemingly simple since the number of stationary deterministic strategies is finite, the task of finding a pair of optimal strategies, or equivalently, of computing the so-called optimal values of nodes, is in complexity class $\mathrm{PPAD}$ [13] but not known to be in $\mathrm{P}$ .

Simple stochastic games are a powerful model since they can simulate many other games such as parity games, mean or discounted payoff games [2, 7]. However these games are believed to be simpler than SSGs and better algorithms are known for them; in particular, parity game is in quasi-polynomial time [5]. Stochastic versions of the previous games also exist and are computationally equivalent to SSGs [2]. Interestingly, SSGs have many application domains, for instance autonomous urban driving [9], smart energy management [8], model checking of the modal $\mu$ -calculus [23], etc.

There are some restrictions for SSGs for which the problem of finding optimal strategies is tractable. If the game is acyclic, it can be solved in linear time, and in polynomial time for almost acyclic games (few cycles or small feedback arc sets) [3]. If there is no randomness, the game can be solved in almost linear time [1]. Furthermore, Gimbert and Horn were the first to extend this result by giving Fixed Parameter Tractable (FPT) algorithms in the number of random nodes [15]. They indeed show that optimal strategies depend only on the ordering of the values of random nodes, and not on their actual values. Using this idea, they devise two algorithms. The first one exhaustively enumerates these orders until it finds one that actually corresponds to optimal values. The second one is a strategy improvement algorithm based on an iterative refinement of the orders. Both have a complexity of $k!n^{O(1)}$ , where $k$ is the number of random nodes. It has been improved to $\sqrt{k!}n^{O(1)}$ expected time in [12], by randomly selecting a good strategy as a starting point for a strategy improvement algorithm. In fact, as remarked in [6], the distance between the values of two consecutive strategies in any strategy improvement algorithm depends on the number of random nodes. Hence any SSG can be solved in time $4^{k}n^{O(1)}$ (in fact $\sqrt{6}^{k}n^{O(1)}$ using Lemma $1.1$ in [3]). The complexity has been further improved to $2^{k}n^{O(1)}$ in [19], by using a value iteration algorithm. Here a bit of caution is in order; in some papers, random nodes can have an arbitrary outdegree and probability distribution on outgoing arcs, and in some other they must be binary with uniform distribution. In the former case, if we denote by $p$ the bit-size of the largest probability distribution on a random node, the first two cited algorithms have a complexity of $p\cdot k!$ and $p\cdot\sqrt{k!}$ . On the other hand, the two algorithms with an exponential complexity in $k$ have an exponential dependency on $p$ when adapted to this context.

Without the previous restrictions, only algorithms running in exponential time are known. Most of them are strategy improvement algorithms, which produce a sequence of strategies of increasing values. These algorithms, such as the classical Hoffman-Karp [18] algorithm, rely on the switch operation, which by a local best-response, produces a strategy with better value. Several ways of choosing the nodes that are switched have been proposed [24], which can be compared to the rules for pivot selection for the simplex algorithm in linear programming. Though efficient in practice, these algorithms fail to run in polynomial time on a well designed input [14]. The best algorithm so far, proposed by Ludwig [21, 16], is also a strategy iteration algorithm using a randomized version of Bland’s rule [4] to choose a switch. It solves any SSG in expected time $2^{O(\sqrt{n})}$ . The first analysis of this kind of algorithm is due to Kalai [20] and it has been slightly improved recently [17].

Our contributions

In Sec. 3, we present an iterative variant of Ludwig’s recursive algorithm which uses less random bits. In the rest of the paper we adapt the idea of this algorithm to carefully enumerate orders of random nodes in an SSG. First, in Sec. 4, we present a pivot operation yielding a strategy improvement algorithm, which improves the one of [15]. This pivot operation comes from a randomized dichotomy on all orders that we explain in details in Sec. 5, using an auxilliary game similar to the one of [12]. We prove that our algorithm finds the optimal strategies in expected time polynomial in $2^{k}$ and $p$ , where $k$ is the number of random nodes and $p$ is the maximum bit-length of a distribution on a random node, answering positively a question of Ibsen-Jensen and Miltersen [19].

2 Definitions and classic results on simple stochastic games

We here review definitions and results related to SSGs. We only sketch what we need and refer to longer expositions such as [11, 24] for more details.

Definition 2.1 (SSG)

A simple stochastic game (SSG) is defined by a directed graph $G=(V,A)$ , where $V$ is the set of nodes and $A$ the set of arcs, together with a partition of $V$ in four parts $V_{\text{\sc max}}$ , $V_{\text{\sc min}}$ , $V_{\text{\sc ran}}$ and $V_{\text{\sc sink}}$ , whose elements are respectively called max-nodes, min-nodes, ran-nodes (for random) and sinks. We require that every node $x\in V$ has outdegree at least one, while sink nodes have outdegree exactly $1$ consisting of a single loop on themselves. We also specify for every sink $x\in V_{\text{\sc sink}}$ a value $\mbox{Val}(x)$ which is a rational number, and for every random node $x\in V_{\text{\sc ran}}$ a rational probability distribution $p(x)$ on the outneighbours of $x$ .

In the original version of Condon [11], all nodes except sinks have outdegree exactly two, the probability distribution on every ran-node is $(\frac{1}{2},\frac{1}{2})$ , and there are only two sinks, one with value [math] and another with value $1$ . Here, we allow more than two sinks, with general rational values, and also allow more than outdegree two for all non-sink nodes, with an arbitrary probability distribution for ran-nodes. However, for Ludwig’s Algorithm (see Algorithms 1, 2 and 3 in section 3) we shall suppose that all max-nodes have outdegree $2$ and call such games max*-binary*.

Strategies and values

We now define strategies, by which we mean stationary and pure strategies. This is enough for our purpose and it turns out to be sufficient for optimality, see [11]. Such strategies specify the choice of a neighbour for every node of a given player.

Definition 2.2 (Strategy)

A strategy for player max is a map $\sigma$ from $V_{\text{\sc max}}$ to $V$ such that $\forall x\in V_{\text{\sc max}},$ $(x,\sigma(x))\in A.$

Strategies for player min are defined analogously on min-nodes and are usually denoted by $\tau$ .

Definition 2.3 (play)

A play is a sequence of nodes $x_{0},x_{1},x_{2},\dots$ such that for all $t\geq 0$ , $(x_{t},x_{t+1})\in A.$ Such a play is consistent with strategies $\sigma$ and $\tau$ , respectively for player max and player min, if for all $t\geq 0$ , $x_{t}\in V_{\text{\sc max}}\Rightarrow x_{t+1}=\sigma(x_{t})$ and $x_{t}\in V_{\text{\sc min}}\Rightarrow x_{t+1}=\tau(x_{t}).$

A couple of strategies $\sigma,\tau$ and an initial node $x_{0}\in V$ define recursively a random play consistent with $\sigma,\tau$ by setting $(i)\;x_{t+1}=\sigma(x_{t})$ if $x_{t}\in V_{\text{\sc max}}$ , $(ii)\;x_{t+1}=\tau(x_{t})$ if $x_{t}\in V_{\text{\sc min}}$ , $(iii)\;x_{t+1}=x_{t}$ if $x_{t}\in V_{\text{\sc sink}}$ , and finally $(iv)\;x_{t+1}$ is one of the outneighbours of $x_{t}$ , randomly chosen independently of everything else according to probability distribution $p(x)$ , if $x_{t}\in V_{\text{\sc ran}}$ .

Hence, this defines a probability measure ${\mathbb{P}}_{\sigma,\tau}^{x_{0}}$ on plays consistent with $\sigma,\tau$ . Note that if a play contains a sink node $x_{s}$ , then at every subsequent time the play stays in $x_{s}$ . Such a play is said to reach sink $x_{s}$ . To every play $x_{0},x_{1},\dots$ we associate a value which is the value of the sink reached by the play if any, and [math] otherwise. If we denote by $X$ this value, then $X$ is a random variable once two strategies and an initial node $x$ are fixed. We are interested in the expected value of this quantity, which we call the value of a node $x\in V$ under strategies $\sigma,\tau$ : $\mbox{Val}_{\sigma,\tau}(x)={\mathbb{E}}_{\sigma,\tau}^{x}\left(X\right)$ where ${\mathbb{E}}_{\sigma,\tau}^{x}$ is the expected value under probability ${\mathbb{P}}_{\sigma,\tau}^{x}$ .

The goal of player max is to maximize this (expected) value, and the best he can ensure against a strategy $\tau$ is $\mbox{Val}_{*,\tau}(x):=\max_{\sigma}\mbox{Val}_{\sigma,\tau}(x)$ where the maximum is considered over all max-strategies (which are in finite number). Similarly, against $\sigma$ player min can ensure that the expected value is at most $\mbox{Val}_{\sigma,*}(x):=\min_{\tau}\mbox{Val}_{\sigma,\tau}(x).$

Finally, the value of a node $x$ is $\mbox{Val}_{*,*}(x):=\max_{\sigma}\mbox{Val}_{\sigma,*}(x)=\min_{\tau}\mbox{Val}_{*,\tau}(x).$ The fact that these two quantities are equal is nontrivial, and it can be found for instance in [11]. A pair of strategies $\sigma^{*},\tau^{*}$ such that, for all nodes $x$ , $\mbox{Val}_{\sigma^{*},\tau^{*}}(x)=\mbox{Val}_{*,*}(x)$ always exists and these strategies are said to be optimal strategies. It is polynomial-time equivalent to compute optimal strategies or to compute the values of all nodes in the game.

Definition 2.4 (Stopping SSG)

An SSG is said to be stopping if for every couple of strategies almost all plays eventually reach a sink node.

Usually, this condition is required in order to ensure simple optimality conditions (Thm. 2.5 below). Condon [11] proved that every SSG $G$ can be reduced in polynomial time to a stopping SSG $G^{\prime}$ whose size is quadratic in the size of $G$ , and whose values almost remain the same. The values of the new game are close enough to recover the values of the original game. A problem for us is that squaring the size of the game does not behave well relatively to precise complexity bounds.

However, in our case we need a milder condition. We call a max-strategy $\sigma$ stopping if, for any min-strategy $\tau$ , the random play consistent with $(\sigma,\tau)$ reaches a sink with probability one.

Theorem 2.5 (Optimality conditions, [11])

Let $G$ be an SSG, $\sigma$ a stopping max-strategy and $\tau$ a min-strategy. Then $(\sigma,\tau)$ are optimal strategies if and only if

•

for every $\displaystyle x\in V_{\text{\sc max}}$ , $\displaystyle\mbox{Val}_{\sigma,\tau}(x)=\max_{(x,y)\in A}\mbox{Val}_{\sigma,\tau}(y)$ ;

•

for every $x\in V_{\text{\sc min}}$ , $\displaystyle\mbox{Val}_{\sigma,\tau}(x)=\min_{(x,y)\in A}\mbox{Val}_{\sigma,\tau}(y)$ .

Switches and strategy improvement

Consider the usual partial order on real vectors indexed by $V$ , i.e. for $w_{1},w_{2}\in{\mathbb{R}}^{V}$ , denote $w_{1}\leq w_{2}$ if $w_{1}(x)\leq w_{2}(x)$ for all $x\in V$ , and denote $w_{1}<w_{2}$ if $w_{1}\leq w_{2}$ and at least one inequality is strict. For two max-strategies $\sigma,\sigma^{\prime}$ , simply denote $\sigma\leq\sigma^{\prime}$ (resp. $\sigma<\sigma^{\prime}$ ) if $\mbox{Val}_{\sigma,*}\leq\mbox{Val}_{\sigma^{\prime},*}$ (resp. $\mbox{Val}_{\sigma,*}<\mbox{Val}_{\sigma^{\prime},*}$ ). Define a similar order on min-strategies.

A switch, given a strategy, is the fact of changing this strategy at a node (or a set of nodes) in order to obtain a new one.

Definition 2.6

Let $\sigma,\sigma^{\prime}$ be max-strategies. We say that $\sigma^{\prime}$ is a profitable switch of $\sigma$ if for all $x\in V_{\text{\sc max}}$ , one has $\mbox{Val}_{\sigma,*}(\sigma^{\prime}(x))\geq\mbox{Val}_{\sigma,*}(\sigma(x))$ with this condition strict for at least one max-node (such a node is said to be switchable).

Indeed, the following result states that such a switch actually improves values

Theorem 2.7 ([10], [24])

If $\sigma^{\prime}$ is a profitable switch of $\sigma$ , then $\sigma^{\prime}>\sigma$ .

Before ending this section, please note that Th. 2.5 can be restated in terms of nonexistence of switchable node. Hence, we have the following result:

Theorem 2.8

A stopping max-strategy is optimal if and only if it has no switchable nodes.

For the last section, we require another form of switch.

Theorem 2.9 ([10], [24])

Let $\sigma,\sigma^{\prime}$ be stopping max-strategies and $\tau,\tau^{\prime}$ be min-strategies such that for all $x\in V_{\text{\sc max}}$ , $\mbox{Val}_{\sigma,\tau}(\sigma^{\prime}(x))\geq\mbox{Val}_{\sigma,\tau}(\sigma(x))$ and for all $x\in V_{\text{\sc min}}$ , $\mbox{Val}_{\sigma,\tau}(\tau^{\prime}(x))\geq\mbox{Val}_{\sigma,\tau}(\tau(x))$ with one of these conditions strict for at least one node. Then $\mbox{Val}_{\sigma^{\prime},\tau^{\prime}}>\mbox{Val}_{\sigma,\tau}.$

Orders

For $k\geq 1$ consider the set of integers $[1,k]=\{1,2,\cdots,k\}$ and let ${\cal T}(k)$ denote the set of total orders on $[1,k]$ . For sake of clarity we view these orders as sets of couples $(i,j)\in[1,k]^{2}$ satisfying reflexivity, transitivity and antisymmetry.

If $t\in{\cal T}(k)$ , it can also be described in ascending ordering such as $[x_{1},x_{2},\dots,x_{k}]$ where $(x_{i},x_{j})\in t$ if and only if $i\leq j$ . An interval in $t$ is a sequence of consecutive elements in ascending ordering. The rank of an element $x\in[1,k]$ is the number of elements that are lower of equal to $x$ in $t$ , i.e. it is $i$ if $x=x_{i}$ with notation above.

For lack of a better word, we define a pretotal order as an antisymmetric and reflexive relation and denote by ${\cal P}(k)$ the set of pretotal orders on $[1,k]$ . If $p\in{\cal P}(k)$ and $(i,j)\not\in p$ is such that $p\cup\{(i,j)\}$ is still antisymmetric, we denote simply by $p+(i,j)$ this new pretotal order.

If $t\in{\cal T}(k)$ and $v_{1},v_{2},\cdots,v_{k}$ are real numbers, we say that the $v_{i}$ ’s are nondecreasing along $t$ if $(i,j)\in t\Rightarrow v_{i}\leq v_{j}$ . Likewise, we say that $t$ is a nondecreasing order for $v_{1},v_{2},\dots,v_{k}$ .

3 Iterative formulation of Ludwig’s algorithm

In this part, we suppose that $G$ is max-binary. Hence, if a node $x$ is switchable there is a single possibility for changing the strategy’s choice at this node. Let $switch(\sigma,x)$ denote the profitable switch obtained from $\sigma$ by switching $\sigma$ at node $x$ .

3.1 Bland’s rule version

In [21], Ludwig mentions that his algorithm is a version of Bland’s rule, however he does not make it explicit and gives a recursive definition. We formulate his algorithm iteratively (see Algorithm 1), and show that instead of randomly choosing a node at every step, we can choose a total order on nodes prior to the execution of the algorithm. This version uses much less random bits : $O(n\log n)$ bits instead of $2^{O(\sqrt{n})}$ in average in Ludwig’s version.

By Theorems 2.7 and 2.8 if we proceed by switching Strategy $\sigma$ until there are no more switchable nodes, we reach an optimal strategy in a finite number of steps. The number of steps is at most the number of max-strategies, i.e. $2^{|V_{\text{\sc max}}|}$ . However, we have the following:

Theorem 3.10

The expected number of strategies considered by Alg. 1 is at most $e^{2\sqrt{|V_{\text{\sc max}}|}}$ .

3.2 Analysis of Algorithm 1

Our strategy to prove Theorem 3.10 is to reformulate Alg. 1 as a recursive algorithm (see Alg. 3), which is close to Ludwig’s algorithm in [21]. The proofs are quite similar to Ludwig’s, with a bit of caution on the moments where random choices are made. In particular, we detail our strategy in this part since it will be helpful to understand our results in section 4 where the context is more involved.

Stated as above, it is perhaps unclear how Alg. 1 has a recursive structure. Too see this, consider an execution of Alg. 1, and let $x_{1}$ be the last max-node in the order $\Theta$ . In the beginning, the current strategy $\sigma$ makes an initial choice $\sigma(x_{1})$ on $x_{1}$ , which does not change until the first time when $x_{1}$ becomes switchable (if this happens). If $x_{1}$ is switched, then $\sigma(x_{1})$ will then remain unchanged until the end of this algorithm. Hence, once $\Theta$ is fixed, we can think of this execution as two parts, where $\sigma(x_{1})$ is fixed in each part. These can then be decomposed as subparts where $\sigma(x_{1})$ and $\sigma(x_{2})$ are fixed (where $x_{2}$ is the second-to-last max-node in order $\Theta$ ), and so on.

Generalization to partially fixed strategies

To formalize the discussion above, we give a generalization which can be applied to the case where $\sigma(x)$ is fixed for some vertices in a given set $F$ (see Alg. 2).

In the following, if $F$ is a set of max-nodes and $\sigma$ is a max-strategy, a $(\sigma,F)$ -compatible strategy is any max-strategy $\sigma^{\prime}$ such that $\forall x\in F$ , $\sigma^{\prime}(x)=\sigma(x).$ For $F$ and $\sigma$ fixed, there is always a $(\sigma,F)$ -strategy that is better than all others. It can be obtained by solving the game where any $x\in F$ is replaced by a random node with a probability $1$ to go to $\sigma(x)$ . We call such a $(\sigma,F)$ -compatible strategy optimal and we denote it by $\mathrm{opt}(\sigma,F)$ . In particular, an optimal $(\sigma,\emptyset)$ -strategy is an optimal strategy for $G$ , whereas $\sigma$ is the only $(\sigma,V_{\text{\sc max}})$ -compatible strategy.

Recursive reformulation

Finally, we give a recursive version of Alg. 2 (see Alg. 3) which we use to derive the bound. The equivalence between these two algorithms should be clear by the previous explanations.

Evaluating the number of switches

Let $f^{\Theta}(\sigma,F)$ be the total number of switches performed by Algorithm 3 on input $\sigma,\Theta,F$ . We consider for the following lemma an execution of this algorithm.

Lemma 3.11

Let $\sigma_{0}$ be the initial strategy and $v_{0}$ be the last node which is not in $F$ , according to order $\Theta$ . Define $B\subset V_{\text{\sc max}}\setminus F$ to be the set of nodes $v$ such that $\mathrm{opt}(\sigma_{0},F\cup\{v\})\not>\mathrm{opt}(\sigma_{0},F\cup\{v_{0}\}).$ Then $f^{\Theta}(\sigma_{0},F)\leq f^{\Theta}(\sigma_{0},F\cup\{v_{0}\})+1+f^{\Theta}(\sigma_{2},F\cup B\}),$ where $\sigma_{2}$ is $\mathrm{opt}(\sigma_{0},F\cup\{v_{0}\})=\sigma_{1}$ , switched at $v_{0}$ .

Proof 3.12

By design of the algorithm, nodes of $F$ are never switched. If $v_{0}$ is never switched, then we have

[TABLE]

hence the result is true.

Suppose from now on that $v_{0}$ is switched during the execution. Then, it is switched only once and we can divide the computation of $\mathrm{opt}(\sigma_{0},F)$ in two parts:

•

in a first part the algorithm computes $\mathrm{opt}(\sigma_{0},F\cup\{v_{0}\})$ , and this last strategy is switched at $v_{0}$ , hence obtaining $\sigma_{2}$ ;

•

then in a second part the algorithm computes $\mathrm{opt}(\sigma_{2},F)$ .

Hence, we have in the case that $v_{0}$ is switched,

[TABLE]

It remains to see that in the second part, nodes from $B$ will never be switched, so that $f^{\Theta}(\sigma_{2},F)=f^{\Theta}(\sigma_{2},F\cup B)$ , hence the result.

Let $\sigma^{\prime}$ be a strategy obtained during the second part of the algorithm (after $v_{0}$ is switched) such that $\sigma^{\prime}(v)=\sigma_{0}(v)$ for a $v\in V_{\text{\sc max}}\setminus F$ . On the one hand, by definition of $\mathrm{opt}$ we have

[TABLE]

On the other hand, since $\sigma^{\prime}$ is obtained after $\sigma_{2}$ , then

[TABLE]

By transitivity we see that

[TABLE]

hence $v\not\in B$ .

Therefore, in the second part of the algorithm, all strategies $\sigma^{\prime}$ satisfy $\sigma^{\prime}(v)\neq\sigma_{0}(v)$ for all $v\in B$ , hence nodes in $B$ have been switched in the first part and never will be in the second.

Now, let us denote $\Phi(n)=\sup_{G,\sigma}{\mathbb{E}}^{\Theta}\left[f^{\Theta}_{G}(\sigma,\emptyset)\right]$ where the supremum is considered over all SSG $G$ with $n$ max-nodes and all max-strategies $\sigma$ in $G$ . The average is considered over all possible prior choices of order $\Theta$ , the rest of the algorithm being deterministic.

Lemma 3.13

For all $n\geq 1$ , $\Phi(n)\leq\Phi(n-1)+1+\frac{1}{n}\sum_{i=0}^{n-1}\Phi(i).$

We shall first need the following result.

Lemma 3.14

Let $(X,\leq)$ be a partially ordered set and define for any $x\in X$ $a(x)=|\{y\in X:y\not>x\}|$ , i.e. the number of elements that are not greater than $x$ . Then for all $0\leq i\leq|X|$ , we have

[TABLE]

Proof 3.15

Fix $0\leq i\leq|X|$ and consider the set $P\subset X$ of $x\in X$ with $a(x)\leq i$ . Let $x_{0}$ be maximal among elements of $P$ . Since there are at least $n-i$ elements in $X$ that are strictly greater than $x_{0}$ , and that these elements are not in $P$ by maximality of $x_{0}$ , we have $|P|+n-i\leq n$ i.e. $|P|\leq i$ .

Proof 3.16

First, denote $\Phi(n,k)$ for $n\geq 1$ and $0\leq k\leq n$

[TABLE]

where the supremum is considered over all SSG $G$ with $n$ max-nodes, subsets $H\subset V_{\text{\sc max}}$ of size $k$ and all max-strategies $\sigma$ .

Consider $G,H$ and $\sigma$ fixed, with $|V_{\text{\sc max}}|=n$ and $|H|=k$ . Using notation of Lemma 3.11, we have

[TABLE]

Here, we denote $v_{0}$ and $B$ by $v_{0}^{t}$ and $B^{t}$ to stress the fact that these are random variables depending on $\Theta$ , whereas everything else ( i.e. $G,F,\sigma$ ) is fixed.

First, since for all $v\not\in H$ ,

[TABLE]

we have

[TABLE]

Now,

[TABLE]

The sum on the right can easily be rewritten as

[TABLE]

where we defined for convenience $\Phi(n,n+1)=0$ , and used the fact that $|B^{t}|\geq 1$ .

Using now Lemma 3.14 on the set of strategies $\mathrm{opt}(\sigma,H\cup\{v\})$ for $v\not\in H$ , we see that

[TABLE]

since $v_{0}$ is uniformly chosen in $V_{\text{\sc max}}\setminus H$ .

So, since $\Phi(n,k+i)\geq\Phi(n,k+i+1)$ , we deduce that

[TABLE]

In order to conclude and prove Theorem 3.10, we now just have to infer the bound for sequences satisfying the conclusion of Lemma 3.13.

Lemma 3.17 (Lemma $9$ of [21])

*Let $\Phi(n)$ be such that $\Phi(0)=0$ and for all $n\geq 1$ ,

$\Phi(n)\leq\Phi(n-1)+1+\frac{1}{n}\sum_{i=0}^{n-1}\Phi(i).$ Then for all $n\geq 0$ , $\Phi(n)\leq e^{2\sqrt{n}}.$ *

4 Simple stochastic games with few random nodes

The idea that in an SSG, the optimal strategies depend only on the ordering of the values of ran-nodes, and not on their actual values, has been introduced by Gimbert and Horn in [15]. Their main idea is that, if one gives an ordering $r_{1}r_{2}\cdots r_{k}$ of ran-nodes such that $\mbox{Val}_{*,*}(r_{i})$ is nondecreasing with $i$ , then max will try to reach a node $r_{i}$ with $i$ as high as possible, whereas min will try to minimize this index; this idea is hereafter formalized by the notion of forcing sets and forcing strategies (sec. 4.1). Gimbert and Horn use this fact to derive an algorithm that will enumerate all possible orders on ran-nodes an will identify one with the property mentionned above, yiedling the optimal strategies and values for $G.$

The algorithm that we describe and analyse in the rest of this paper (Alg. 4) uses the same principle, but iterates through orders in a special way, similarly to the iteration through strategies made by Ludwig’s algorithms (see sec. 3). We will derive a similar bound for the average number of iterations of this randomized algorithm. Hence, our main algorithm is still a variation on Bland’s rule for pivot selection. The difficulty here does not lie in the proof of the bound, but in the description of the technique used to iterate on orders.

In [15], the game remains the same during the execution of the algorithm, but we proceed differently:

•

in section 4.1, we describe how to associate to every total order $t\in{\cal T}(k)$ a new SSG $G[t]$ , and we show that this game can be solved in polynomial time.

•

in section 4.2, we prove that there is an optimal order $t^{*}\in{\cal T}(k)$ such that the optimal values of $G[t^{*}]$ give directly the optimal values of $G$ ; it is also the order that maximises values of $G[t]$ among all total orders $t$ . If an order $t$ is not optimal, we describe a pivot operation yielding from $t$ a new order $t^{\prime}$ such that the optimal values of $G[t^{\prime}]$ improve those of $G[t]$ .

•

the proof of the bound will be derived in section 5.

4.1 Modified game and forcing strategies

We need to assume that the games we consider enjoy some basic properties in order to describe our algorithm without considering too many special cases.

Definition 4.18

*An SSG is in canonical form (CF) if max has a stopping strategy and only ran-nodes can have an outgoing arc to a sink. *

To ensure these conditions, one can first in linear time find and remove all nodes from which min player can force the game never to reach neither a sink node nor a ran-node (see e.g. [1, 11]). These nodes have value [math] and can as well be removed from the game. Then, all probabilities on ran-nodes are modified by giving them a very small probability to go to a sink. One can prove as in [11] that values remain almost the same. The second condition ensures that all max and min nodes have to reach a ran-node in order to reach a sink. It can be done by adding a dummy random node before every sink.

In all that follows we suppose that $G$ is an SSG in CF with random nodes $r_{1},r_{2},\dots,r_{k}$ . Let $t\in{\cal T}(k)$ be a total order on $[1,k]$ . We define a game $G[t]$ as follows (the same construction is presented in [12]). Start with a copy of $G$ . For every $1\leq i\leq k$ , add a min-node denoted $i$ to $G[t]$ , which we call control node; add an arc $(i,r_{i})$ ; for every arc $(x,r_{i})\in A$ , remove this arc and add an arc $(x,i)$ ; finally, for every $(i,j)\in t$ , $i\neq j$ , add the arc $(i,j)$ to $G[t]$ .

So basically, every control node $i\in[1,k]$ intercepts all arcs entering in $r_{i}$ (see Fig. 1), and has an arc to every other control node $j\in[1,k]$ which is greater than $i$ in $t$ . In the game $G[t]$ , the set of sinks, max-nodes and ran-nodes remain the same as in $G$ , whereas the set of min-nodes will be denoted $V_{\text{\sc min}}\cup[1,k]$ , where $V_{\text{\sc min}}$ is the set of min-nodes in $G$ . This allows us to directly identify max-strategies in $G[t]$ and in $G$ , and to identify projections onto $V_{\text{\sc min}}$ of min-strategies in $G[t]$ , to min-strategies in $G$ .

Now, suppose we remove first all sinks and random nodes of $G[t]$ , and then turn every control node $i$ into a sink with a value equal to its rank in $t$ . This transformation clearly turns $G[t]$ into a game $G^{\prime}$ without random nodes.

Definition 4.19 (Forcing strategy)

By identifying strategies in $G[t]$ and $G^{\prime}$ , we say that any optimal strategy for max or min in $G^{\prime}$ is a * $t$ -forcing strategy of $G[t]$ .*

In $t$ -forcing strategies, the players try to ensure the reaching of a control node as high as possible for max, and as low as possible for min, in the order $t$ . We refer to [1] and [15] for more details about how one can compute these optimal strategies in linear time, using the so-called deterministic attractors.

Definition 4.20 (Forcing set)

For any control node $i\in[1,k]$ , define the forcing set for $i$ , denoted $\text{\sc For}[t](i)$ , as the set of max and min-nodes that reach $i$ if the game is played with a couple $(\sigma_{t},\tau_{t})$ of $t$ -forcing strategies (forcing sets are independant of the choice of the strategies as long as they are $t$ -forcing).

An example of an SSG turned into a modified SSG and of computation of forcing strategies is presented in Fig. 2.

Here are basic properties on $G[t]$ which should explain why we consider this game.

Lemma 4.21

(i)

if $G$ has stopping max-strategy, so does $G[t]$ ; 2. (ii)

optimal values $\mbox{Val}_{*,*}(i)$ of control nodes $i\in[1,k]$ in $G[t]$ are nondecreasing along $t$ ; 3. (iii)

optimal strategies in $G[t]$ coincide with forcing strategies for order $t$ on $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ ; 4. (iv)

the game $G[t]$ can be solved in polynomial time.

Proof 4.22

To see why $(i)$ is true, just note that since $t$ is an antisymmetric relation, this does not create new cycles among min-nodes.

Suppose now that $(i,j)\in t$ . By optimality for the min player, and since $G[t]$ is stopping, $\mbox{Val}_{*,*}(i)$ is the minimum value of $\mbox{Val}_{*,*}(x)$ for all outneighbours $x$ of $i$ (see Th. 2.5). Since $j$ is an outneighbour of $i$ in $G[t]$ , we have $\mbox{Val}_{*,*}(i)\leq\mbox{Val}_{*,*}(j)$ . Hence $(ii)$ is true.

Now, consider replacing in $G[t]$ every control node $i\in[1,k]$ by a new sink $s_{i}$ with value $\mbox{Val}_{*,*}(i)$ . Clearly the values of this new game remain the same. But, by construction of $G[t]$ , random nodes have no incoming arcs and they could be as well removed without changing the optimal values on $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ . By reducing the game in this way, we get a deterministic game whose optimal values on $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ are the same as those of $G[t]$ . By definition, optimal strategies of this game are $t$ -forcing strategies, hence $(iii)$ is true.

Finally, to solve $G[t]$ we can choose a couple $(\sigma_{t},\tau_{t})$ of $t$ -forcing strategies and search for optimal strategies in $G[t]$ that match with $(\sigma_{t},\tau_{t})$ on $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ . Hence, the strategy of all max-nodes is fixed, and only min-strategies on control nodes are computed by solving a one player SSG. It can be done in polynomial time by linear programming (see [11]).

As explained in the proof above, to solve $G[t]$ , it is enough to compute $t$ -forcing strategies on $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ , which can be done in linear time, and then to solve a one player SSG with only $O(k)$ nodes.

4.2 Value intervals and pivot

In what follows, we write $\mbox{Val}[t]$ for the vector of optimal values of $G[t]$ .

Definition 4.23 (Constrained control node)

We say that a control node $i\in[1,k]$ is constrained in $G[t]$ if $\mbox{Val}[t](i)<\mbox{Val}[t](r_{i})$ .

Constrained control nodes are similar to switchable nodes in SSG. In fact, we can characterize optimality of an order by the absence of constrained node as follows.

Lemma 4.24 (Optimal order)

Let $t\in{\cal T}(t)$ . The game $G[t]$ does not have any constrained control nodes if and only if the forcing strategies $(\sigma_{t},\tau_{t})$ are optimal strategies for $G$ . In this case we say that $t$ is an optimal order for $G$ .

Proof 4.25

First note that since $G$ is in CF, $\sigma_{t}$ is always stopping.

If $G[t]$ does not have any constrained control nodes, then optimal strategies are the forcing strategies $(\sigma_{t},\tau_{t})$ on $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ , together with the choice $(i,r_{i})$ for each control node $i\in[1,k]$ . Then, by merging the control nodes with their associated random node while removing the unused arcs between the control nodes (hence recovering the initial game G), the values on the remaining nodes are kept, and so are the optimality conditions of Th. 2.5.

If $(\sigma_{t},\tau_{t})$ are optimal strategies for $G$ , then the values $v_{1},v_{2},\dots,v_{k}$ of the ran-nodes are nondecreasing along order $t$ . Hence, by turning $G$ into $G[t]$ and extending strategies $(\sigma_{t},\tau_{t})$ with the choice $(i,r_{i})$ for each control node $i\in[1,k]$ , we will obtain values that satisfy optimality conditions and such that $Val[t](i)=Val[t](r_{i})$ , showing that $i$ is not constrained.

We define the value interval of a control node $i\in[1,k]$ as the set of $j\in[1,k]$ that share the same optimal value in $G[t]$ , i.e. $\mbox{Val}[t](i)=\mbox{Val}[t](j)$ . This set is indeed an interval in order $t$ by $(ii)$ of Lemma 4.21, i.e. its elements are consecutive in order $t$ .

Definition 4.26

The pivot operation on a control node $i\in[1,k]$ for the order $t$ is the transformation of $t$ into a new order $t^{\prime}\in{\cal T}(k)$ , obtained by moving $i$ just after the end of its value interval in $t$ .

Note that if $i$ is the last node of its value interval, then the pivot operation does nothing. Also note that if $i$ is constrained, it cannot be the last node of its value interval (we shall only pivot on constrained control nodes).

Example. Let $k=7$ and let $t$ be in ascending order $[7,2,4,1,3,6,5]$ . Suppose that the values of control nodes are, in this order $[0.2,0.2,0.3,0.3,0.3,0.4,0.4]$ . The value intervals are $[7,2]$ , $[4,1,3]$ and $[6,5]$ . The pivot operation on $4$ places $4$ after $3$ , so that the obtained order would be $[7,2,1,3,4,6,5]$ .

The following theorem shows that the pivot operation increases the value vector, which will enable us to design a strategy improvement algorithm on the forcing strategies (where the improvement is on $\mbox{Val}[t]$ rather than on values in the original game $G$ ). A similar theorem is proved in [12] to build a different strategy improvement algorithm.

Theorem 4.27

Let $t\in{\cal T}(t)$ and $i\in[1,k]$ be a constrained control node. If $t^{\prime}$ is obtained from $t\in{\cal T}(k)$ by pivoting on $i$ , then $\mbox{Val}[t^{\prime}]>\mbox{Val}[t]$ .

Proof 4.28

Consider a new game $G[t+t^{\prime}]$ which is obtained from $G$ like $G[t]$ and $G[t^{\prime}]$ but with arcs $(i,j)$ for $i\neq j$ between control nodes for all $(i,j)\in t\cup t^{\prime}$ . Let $(\sigma,\tau)$ and $(\sigma^{\prime},\tau^{\prime})$ be respective optimal strategies in $G[t]$ and $G[t^{\prime}]$ . We can interpret these strategies as strategies in $G[t+t^{\prime}]$ . Since the only difference between $G[t]$ , $G[t^{\prime}]$ and $G[t+t^{\prime}]$ are the arcs between control nodes, all strategies $(\sigma,\tau)$ give exactly the same values in $G[t]$ and in $G[t+t^{\prime}]$ , and a similar observation can be made for $G[t^{\prime}]$ . Hence, to prove the result, is is enough to show that $\mbox{Val}_{\sigma^{\prime},\tau^{\prime}}\leavevmode\nobreak\ >\mbox{Val}_{\sigma,\tau}$ in $G[t+t^{\prime}]$ . Note that whereas $\sigma$ and $\sigma^{\prime}$ are respective stopping max-strategies of $G[t]$ and $G[t^{\prime}]$ , they could be not stopping in $G[t+t^{\prime}]$ . However it is not difficult to see that conclusion of Th. 2.9 would still apply. Hence it is sufficient to show in $G[t+t^{\prime}]$ that changing $(\sigma,\tau)$ into $(\sigma^{\prime},\tau^{\prime})$ makes a nondecreasing switch on every node, and is increasing in at least one node.

In the order $t$ , let $I=[i,i_{1},i_{2},\cdots,i_{\ell}]$ , with $\ell\geq 1$ be the increasing sequence of consecutive nodes sharing the same value as $i$ for $(\sigma,\tau)$ (i.e. the value interval of $i$ , starting from $i$ ). Since $i$ is constrained, $\ell\geq 1$ .

The pivot operation transforms this part of $t$ into $[i_{1},i_{2},\cdots,i_{\ell},i]$ , hence the only differences between $G[t]$ and $G[t^{\prime}]$ are the $\ell$ arcs $(i,i_{c})$ for $1\leq c\leq\ell$ that are inverted into $(i_{c},i)$ . Hence, if $j\not\in I$ , it keeps the same position relatively to all other control nodes when we change the order $t$ into $t^{\prime}$ , hence $\text{\sc For}[t](j)=\text{\sc For}[t^{\prime}](j)$ . Hence, when we change $(\sigma,\tau)$ to $(\sigma^{\prime},\tau^{\prime})$ , either there is no switch in $\text{\sc For}[t](j)$ , or it is between nodes of the same values.

For nodes $x\in\text{\sc For}[t](j)$ with $j\in I$ , clearly $\sigma^{\prime}(x)$ (resp. $\tau^{\prime}(x)$ ) is in some $\text{\sc For}[t](j^{\prime})$ with $j^{\prime}\in I$ . Since all these nodes also share the same value (by definition of the value interval), these switches are also between nodes of same values.

Suppose that there is a decreasing switch on a control node, i.e. for a $j\in[1,k]$ we have $\mbox{Val}[t](\tau^{\prime}(j))<\mbox{Val}[t](j)$ . In this case $\tau^{\prime}(j)$ should be stricly before $j$ in $t$ since optimal values are increasing along $t$ . So we could not have $(j,\sigma^{\prime}(j))\in t$ but should have $(j,\sigma^{\prime}(j))\in t^{\prime}$ . The only possibility is $\sigma^{\prime}(j)=i$ and $j\in I$ . Since these nodes are in the same value interval, once again this switch is unchanging, a contradiction.

*We showed that no switch from $(\sigma,\tau)$ to $(\sigma^{\prime},\tau^{\prime})$ is decreasing. Now consider the case of $i$ during the pivot operation. Since $i$ is constrained, $\mbox{Val}[t](i)<\mbox{Val}[t](r_{i})$ . Since $\tau^{\prime}(i)$ can either be equal to $r_{i}$ or to some $j$ which is striclty after the value interval of $i$ in order $t$ , hence has a greater value, we see that the switch at $i$ must be increasing. *

4.3 Main algorithm

Algorithm 4 consists in iterating on orders $t\in{\cal T}(k)$ , by picking randomly a pivotable element in $t$ and updating $t$ by a pivot on $i$ , until we reach an optimal order.

Here is the pivot selection rule. First, prior to the execution of the algorithm, we choose randomly and uniformly an order $\Theta$ on the set of all $\frac{k(k-1)}{2}$ unordered pairs of control nodes $\{i,j\}$ , with $i,j\in[1,k]$ . Then, at each step of the algorithm, consider the game $G[t]$ , and remove one by one the arcs between control nodes, following order $\Theta$ . During this process, choose as pivot the first constrained control node, if any, which is disconnected from the following nodes of its value interval. In more detail, for a given order $t$ , compute $\mbox{Val}[t]$ and then partition the control nodes into value intervals. Each constrained control node $i$ has $d(i)$ arcs leading to other control nodes from the same value interval, where $d(i)$ is its distance in $t$ to the last element of this interval. Enumerating $\Theta$ in ascending order, the pivot is the first constrained node $i$ whose $d(i)$ arcs are encountered.

Example. Continued from the previous example with $k=7$ and value intervals $[7,2]$ , $[4,1,3]$ and $[6,5]$ . Suppose that the order $\Theta$ starts $\{2,5\},\{7,6\},\{1,4\},\{2,7\}\dots.$ The first element that is disconnected from its value interval is $7$ which is the one we choose as a pivot leading to order $[2,7,4,1,3,6,5]$ .

By Th. 4.27, no order $t\in{\cal T}(k)$ is repeated during the execution of Algorithm 4; since ${\cal T}(k)$ is finite, the algorithm reaches in a finite number of steps an order $t^{*}\in{\cal T}(k)$ which has no constrained node, i.e. which is optimal by Lemma 4.24. Hence, Algorithm 4 computes optimal strategies for $G$ in at most $k!$ steps. However, we claim the following result, which will be proved in the next section.

Theorem 4.29

Alg. 4 computes optimal strategies for $G$ in at most $e^{\sqrt{2}\cdot k}$ expected steps.

Note that for $k$ large enough we have $e^{\sqrt{2}\cdot k}<k!$ , whose growth is roughly equivalent to $2^{k\log k}$ . Moreover, the algorithm uses $O(k^{2}\log k)$ random bits to choose the order $\Theta$ on pairs.

5 Analysis of Algorithm 4

In this section we prove Theorem 4.29. To do this we shall reformulate Algorithm 4 as a recursive algorithm, but we need additional notions for this. The recursive formulation also reveals the nature of the algorithm: it computes an optimal order on control nodes by finding the right order between each pair of these nodes using dichotomy. This allows the same analysis as for Ludwig’s Algorithm and its variants.

5.1 Modified game $G[p]$ for a pretotal order $p$

If $p\in{\cal P}(k)$ is a pretotal order, we define $G[p]$ exactly as was defined $G[t]$ for a total order $t\in{\cal T}(k)$ in section 4.1. The only difference is that, since $p$ is not total, a control node $i\in[1,k]$ only has arcs to those $j\neq i\in[1,k]$ such that $(i,j)\in p$ .

To simplify notation, for any node $x$ in $G[p]$ , define $\mbox{Val}_{*}[p](x):=\mbox{Val}_{*,*}^{G[p]}(x)$ as the optimal value of $x$ in $G[p]$ . We can now directly extend some of the observations of Lemma 4.21 to pretotal orders.

Lemma 5.30

If $p\in{\cal P}(k)$ , then optimal values of control nodes $i\in[1,k]$ in $G[p]$ are nondecreasing in order $p$ , i.e. if $(i,j)\in p$ then $\mbox{Val}_{*}[p](i)\leq\mbox{Val}_{*}[p](j)$ .

In order to solve $G[p]$ , the algorithm will recursively compute an optimal total ordering of control nodes $i\in[1,k]$ extending $p$ . Thus, for all total orders $t\in{\cal T}(k)$ extending $p\in{\cal P}(k)$ , we need to assign a value in $G[p]$ , which we denote $\mbox{Val}[p](t)$ . Here is how we define it.

Definition 5.31

Let $t\in{\cal T}(k)$ extending $p\in{\cal P}(k)$ . The values $\mbox{Val}[p](t)$ associated to $t$ in $G[p]$ are the values $\mbox{Val}_{\sigma_{t},\tau_{t}}$ where $\sigma_{t}$ and $\tau_{t}$ satisfy:

(i)

$\sigma_{t}$ * and $\tau_{t}$ are forcing strategies for $G[t]$ ;* 2. (ii)

$\tau_{t}$ * statisfies the min-optimality conditions (Thm. 2.5) on every control node $i\in[1,k]$ .*

As a summary, $\mbox{Val}_{*}[p]$ is the vector of optimal values of game $G[p]$ while $\mbox{Val}[p](t)$ is the vector of optimal values of $G[p]$ when the strategies in $V_{\text{\sc min}}$ and $V_{\text{\sc max}}$ are forcing strategies. It follows that $\mbox{Val}[p](t)\leq\mbox{Val}_{*}[p]$ . Recall that $\mbox{Val}[t]$ is the vector of optimal values of $G[t]$ . Then we have $\mbox{Val}[t]=\mbox{Val}_{*}[t]=\mbox{Val}[t](t)$ .

Definition 5.32 (Optimal order)

Let $p\in{\cal P}(k)$ and $t\in{\cal T}(k)$ extending $p$ ( $p\subset t$ ). We say that $t$ is an optimal total order for $p$ if $\mbox{Val}_{*}[p]=\mbox{Val}[p](t)$ .

The next lemma proves the existence and gives a characterization of optimal orders.

Lemma 5.33

Suppose $G$ is in CF and let $p\in{\cal P}(k),t\in{\cal T}(k),p\subset t$ . Then the following conditions are equivalent:

(i)

$t$ * is an optimal order for $p$ ;* 2. (ii)

$t$ * is a nondecreasing ordering of the values $\mbox{Val}_{*}[p](i)$ for $i\in[1,k]$ ;* 3. (iii)

$\mbox{Val}_{*}[p]=\mbox{Val}[t]$ .

Proof 5.34 (Proof of Lemma 5.33)

First, note that, by definition, optimality conditions are satisfied at control nodes in the definition of $\mbox{Val}[p](t)$ , so it is always true that values $\mbox{Val}[p](t)(i)$ are nondecreasing along $p$ .

Suppose now that $t$ is optimal for $p$ , i.e. $\mbox{Val}_{*}[p]=\mbox{Val}[p](t)$ , and suppose that $t$ is not a nondecreasing ordering of the values $\mbox{Val}_{*}[p](i)$ for $i\in[1,k]$ . Then there must be two consecutive $i,j$ in order $t$ such that $\mbox{Val}_{*}[p](i)>\mbox{Val}_{*}[p](j)$ , and we must have $(i,j)\in t\setminus p$ . Consider the order $t^{\prime}$ that we obtain from $t$ by inverting $j$ and $i$ . Clearly, this order also extends $p$ since the only inversion between $t$ and $t^{\prime}$ is $(i,j)$ . By an argument similar to the proof of Theorem 4.27, it is easy to obtain that $\mbox{Val}[p](t^{\prime})>\mbox{Val}[p](t)$ , which contradicts optimality. Hence we proved $(i)\Rightarrow(ii)$ .

Now suppose $(ii)$ . Let $(\sigma^{*},\tau^{*})$ be an optimal strategy of $G[p]$ ; we show that optimality conditions are met in $G[t]$ . First note that $(\sigma^{*},\tau^{*})$ has the same values in $G[p]$ and $G[t]$ . For the nodes in $V_{\text{\sc max}}\cup V_{\text{\sc min}}$ , no new arcs are added so the optimality conditions are still satisfied. Now, consider control nodes. An arc $(i,j)\in t\backslash p$ cannot lead to a lower value for $i$ by assumption $(ii)$ . Hence the optimality conditions are still satisfied on control nodes, strategies $(\sigma^{*},\tau^{*})$ are optimal for $G[t]$ , so finally $\mbox{Val}_{*}[p]=\mbox{Val}[t]$ and we proved $(ii)\Rightarrow(iii)$ .

*Since $t$ involves more arcs than $p$ between the control min-nodes, we have $\mbox{Val}[t]\leq\mbox{Val}[p](t)\leq\mbox{Val}_{*}[p]$ . Assume $(iii)$ , then $\mbox{Val}_{*}[p]=\mbox{Val}[p](t)$ . Hence we proved $(iii)\Rightarrow(i)$ . *

5.2 Recursive formulation

We now give Algorithm 5, a recursive formulation of Algorithm 4. We will prove that these two algorithms compute exactly the same sequence of total orders, and use the recursive formulation to derive a bound.

Definition 5.35

Let $p_{0}\in{\cal P}(k)$ and $\{i,j\}\notin p_{0}$ such that $p_{1}=p_{0}+(i,j)$ is still a pretotal order. We say that the addition of $(i,j)$ to $p_{0}$ is constraining, or that $(i,j)$ is constrained, if $\mbox{Val}_{*}[p_{1}]<\mbox{Val}_{*}[p_{0}]$ .

When an arc is constrained, it is essential to the min-optimal strategy in $G[p_{1}]$ ; in other words removing this arc would increase optimal values.

Lemma 5.36

Suppose $G$ is in CF and let $p\in{\cal P}(k)$ , $t\in{\cal T}(k)$ , $p\subset t$ . Then the following conditions are equivalent:

(i)

$t$ * is an optimal ordering for $p$ ;* 2. (ii)

the addition of every arc $(i,j)\in t\setminus p$ to $p$ is not constraining;

Proof 5.37 (Proof of Lemma 5.36)

Let $p_{1}=p_{0}+(i,j)$ . Since $p\subset p_{1}\subset t$ , we have $\mbox{Val}[t]\leq\mbox{Val}_{*}[p_{1}]\leq\mbox{Val}_{*}[p]$ .

Assume that $t$ is an optimal order for $p$ . If the addition of $(i,j)$ to $p$ is constraining, then $\mbox{Val}_{*}[t]\leq\mbox{Val}_{*}[p_{1}]<\mbox{Val}_{*}[p]$ which contradicts $\mbox{Val}[t]=\mbox{Val}_{*}[p]$ by Lemma 5.33. Hence we proved $(i)\Rightarrow(ii)$ .

Assume that no arc in $t\backslash p$ is constraining. Then, add sequentially arcs in $t\backslash p$ to $G[p]$ until we get $G[t]$ , hence forming a sequence of games $G[p]=G[p_{0}],G[p_{1}],\dots,G[t]$ . If none of these arcs is used in $G[t]$ then $\mbox{Val}[t]=\mbox{Val}_{*}[p]$ and $(i)$ is proved. Otherwise consider the first arc $(i,j)$ such that $\mbox{Val}_{*}[p_{\ell}+(i,j)]<\mbox{Val}_{*}[p_{\ell}]$ . This implies that $\mbox{Val}_{*}[p_{\ell}](j)<\mbox{Val}_{*}[p_{\ell}](i)$ . But $\mbox{Val}_{*}[p]=\mbox{Val}_{*}[p_{\ell}]$ since no constraining arc has been added until step $\ell$ . Hence $\mbox{Val}_{*}[p](j)<\mbox{Val}_{*}[p](i)$ , and finally $(i,j)$ is constraining for $p$ , a contradiction.

Consider a run of the recursive algorithm and let $t$ be a total order at any step of the run. Let us inspect the first time where $t$ is modified. Order $t$ will be optimal for a sequence of pretotal orders that are obtained from $t$ by removing one by one pairs in order $\Theta$ as long as they are not constrained. This in fact amounts to ascending the recursive call tree. Let $(i,j)$ be the first constrained pair and $p_{0}$ the pretotal order obtained once $(i,j)$ is removed. Then $t$ is turned into $t^{\prime}$ by pivoting on node $i$ as we did in iterative Alg. 4. We show now that control node $i$ is same as the pivot selected by the pivot selection rule.

Note first that, during the process of removing the pairs one by one in order $\Theta$ , the value intervals of $G[t]$ are kept unchanged until the pretotal order $p_{0}+(i,j)$ is reached. Since removing $(i,j)$ implies an increase of the optimal values, it means that $i$ and $j$ were in the same value interval and that $i$ had no other neighbour in that interval. Note here that, as a consequence, $p_{0}+(j,i)$ is then guaranteed to be a pretotal order. Clearly, node $i$ is the first control node in that situation. So the choice of node $i$ exactly obeys the pivot selection rule.

Finally, the following lemma enables us to analyze the complexity of Algorithm 5.

Lemma 5.38

Let $p_{0}\in{\cal P}(k)$ and $\{i,j\}\notin p_{0}$ such that $p_{1}=p_{0}+(i,j)$ and $p_{2}=p_{0}+(j,i)$ are pretotal orders and where the addition of $(i,j)$ to $p_{0}$ is constraining. Let $t_{1}$ be an optimal total order for $p_{1}$ , $t_{2}$ obtained from $t_{1}$ by pivoting in $i$ , and let $t^{*}_{2}$ be an optimal total order for $p_{2}$ .

Let $(i_{1},j_{1})$ such that $\mbox{Val}_{*}[p_{0}+(i,j)]\not\leq\mbox{Val}_{*}[p_{0}+(i_{1},j_{1})].$ Then for any total order $t$ obtained by Algorithm 5 between $t_{1}$ and $t_{2}^{*}$ (including those), one has $(j_{1},i_{1})\in t$ .

Proof 5.39

Suppose that $(i_{1},j_{1})\in t$ . Then $t\supset p_{0}+(i_{1},j_{1})$ hence $\mbox{Val}[t]\leq\mbox{Val}_{*}[p_{0}+(i_{1},j_{1})]$ .

On the other hand, since the pivot operation is increasing values, we have $\mbox{Val}_{*}[p_{0}+(i,j)]=\mbox{Val}[t_{1}]<\mbox{Val}[t_{2}]\leq\mbox{Val}[t]$ , so $\mbox{Val}_{*}[p_{0}+(i,j)]\leq\mbox{Val}_{*}[p_{0}+(i_{1},j_{1})]$ , a contradiction.

Using this result, the proof for the complexity bound is the same as the proof of Theorem 3.10 using the recursive formulation. Let $f^{\Theta}(G,p_{0},t_{0})$ be the total number of pivots performed by Algorithm 5 on input $G,p_{0},t_{0}$ for an order $\Theta$ on pairs.

Now define $\Phi(m)=\sup_{G,p_{0},t_{0}}{\mathbb{E}}^{\Theta}\left[f^{\Theta}(G,p_{0},t_{0}\right])$ where the supremum is taken over all games $G$ , pretotal orders $p_{0}$ and total orders $t_{0}$ extending $p_{0}$ such that $t_{0}\setminus p_{0}$ is of size at most $m$ . The expectation is taken over all possible uniform choices for $\Theta$ .

Then by Lemma 5.38, $\Phi(m)$ will satisfy Lemma 3.13, hence the claimed bound of Th. 4.29 by Lemma 3.17 since the depth of the recursive tree is at most $\frac{k(k-1)}{2}$ .

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Daniel Andersson, Kristoffer Arnsfelt Hansen, Peter Bro Miltersen, and Troels Bjerre Sørensen. Deterministic graphical games revisited. In Conference on Computability in Europe , pages 1–10. Springer, 2008.
2[2] Daniel Andersson and Peter Bro Miltersen. The complexity of solving stochastic games on graphs. In International Symposium on Algorithms and Computation , pages 112–121. Springer, 2009.
3[3] David Auger, Pierre Coucheney, and Yann Strozecki. Finding optimal strategies of almost acyclic simple stochastic games. In International Conference on Theory and Applications of Models of Computation , pages 67–85. Springer, 2014.
4[4] Robert G Bland. New finite pivoting rules for the simplex method. Mathematics of operations Research , 2(2):103–107, 1977.
5[5] Cristian S Calude, Sanjay Jain, Bakhadyr Khoussainov, Wei Li, and Frank Stephan. Deciding parity games in quasipolynomial time. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing , pages 252–263. ACM, 2017.
6[6] Krishnendu Chatterjee, Luca de Alfaro, and Thomas A Henzinger. Termination criteria for solving concurrent safety and reachability games. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms , pages 197–206. SIAM, 2009.
7[7] Krishnendu Chatterjee and Nathanaël Fijalkow. A reduction from parity games to simple stochastic games. In Gand ALF , pages 74–86, 2011.
8[8] Taolue Chen, Vojtěch Forejt, Marta Kwiatkowska, David Parker, and Aistis Simaitis. Automatic verification of competitive stochastic systems. Formal Methods in System Design , 43(1):61–92, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Acknowledgements.

Solving Simple Stochastic Games with few Random Nodes faster using Bland’s Rule

Abstract

keywords:

category:

1 Introduction

Our contributions

2 Definitions and classic results on simple stochastic games

Definition 2.1** **(SSG)

Strategies and values

Definition 2.2** **(Strategy)

Definition 2.3** **(play)

Definition 2.4** **(Stopping SSG)

Theorem 2.5** **(Optimality conditions, [11])

Switches and strategy improvement

Definition 2.6

Theorem 2.7** **([10], [24])

Theorem 2.8

Theorem 2.9** **([10], [24])

Orders

3 Iterative formulation of Ludwig’s algorithm

3.1 Bland’s rule version

Theorem 3.10

3.2 Analysis of Algorithm 1

Generalization to partially fixed strategies

Recursive reformulation

Evaluating the number of switches

Lemma 3.11

Proof 3.12

Lemma 3.13

Lemma 3.14

Proof 3.15

Proof 3.16

Lemma 3.17** **(Lemma 999 of [21])

4 Simple stochastic games with few random nodes

4.1 Modified game and forcing strategies

Definition 4.18

Definition 4.19** **(Forcing strategy)

Definition 4.20** **(Forcing set)

Lemma 4.21

Proof 4.22

4.2 Value intervals and pivot

Definition 4.23** **(Constrained control node)

Lemma 4.24** **(Optimal order)

Proof 4.25

Definition 4.26

Theorem 4.27

Proof 4.28

4.3 Main algorithm

Theorem 4.29

5 Analysis of Algorithm 4

5.1 Modified game G[p]G[p]G[p] for a pretotal order ppp

Lemma 5.30

Definition 5.31

Definition 5.32** **(Optimal order)

Lemma 5.33

Proof 5.34** **(Proof of Lemma 5.33)

5.2 Recursive formulation

Definition 5.35

Lemma 5.36

Proof 5.37** **(Proof of Lemma 5.36)

Lemma 5.38

Proof 5.39

Definition 2.1 (SSG)

Definition 2.2 (Strategy)

Definition 2.3 (play)

Definition 2.4 (Stopping SSG)

Theorem 2.5 (Optimality conditions, [11])

Theorem 2.7 ([10], [24])

Theorem 2.9 ([10], [24])

Lemma 3.17 (Lemma $9$ of [21])

Definition 4.19 (Forcing strategy)

Definition 4.20 (Forcing set)

Definition 4.23 (Constrained control node)

Lemma 4.24 (Optimal order)

5.1 Modified game $G[p]$ for a pretotal order $p$

Definition 5.32 (Optimal order)

Proof 5.34 (Proof of Lemma 5.33)

Proof 5.37 (Proof of Lemma 5.36)