Parity Objectives in Countable MDPs
Stefan Kiefer1,
Richard Mayr2,
Mahsa Shirmohammadi1,
Dominik Wojtczak3
1University of Oxford, UK
2University of Edinburgh, UK
3University of Liverpool, UK
Abstract
We study countably infinite MDPs with parity objectives,
and special cases with a bounded number of colors in the Mostowski hierarchy
(including reachability, safety, Büchi and co-Büchi).
In finite MDPs there always exist optimal
memoryless deterministic (MD) strategies
for parity objectives,
but this does not generally hold for countably infinite MDPs.
In particular, optimal strategies need not exist.
For countable infinite MDPs, we provide a
complete picture of the memory requirements of
optimal (resp., ϵ-optimal) strategies for all objectives
in the Mostowski hierarchy.
In particular, there is a strong dichotomy between two different types of
objectives. For the first type, optimal strategies, if they exist, can be
chosen MD, while for the second type optimal strategies require infinite
memory.
(I.e., for all objectives in the Mostowski hierarchy, if finite-memory randomized strategies suffice then also MD-strategies suffice.)
Similarly, some objectives admit ϵ-optimal MD-strategies,
while for others ϵ-optimal strategies require infinite memory.
Such a dichotomy also holds for the subclass of countably infinite MDPs that are
finitely branching, though more objectives admit MD-strategies here.
Index Terms:
countable MDPs, parity objectives, strategies, memory requirement
I Introduction
Markov decision processes (MDPs) are a standard model for dynamic systems that
exhibit both stochastic and controlled behavior [23].
The system starts in the initial state and makes a sequence of transitions between states. Depending on the type of the current state, either the controller gets to
choose an enabled transition (or a distribution over transitions), or the next
transition is chosen randomly according to a defined distribution.
By fixing a strategy for the controller, one obtains a probability space
of plays of the MDP.
The goal of the controller is to optimize the expected value of
some objective function on the plays of the MDP.
The fundamental questions are “what is the optimal value that the controller
can achieve?”, “does there exist an optimal strategy, or only
ϵ-optimal approximations?”, and “which types of strategies are
optimal or ϵ-optimal?”.
Such questions have been studied extensively for finite MDPs (see
e.g. [10] for a survey) and also for certain types of
countably infinite MDPs [23, 21].
However, the literature on countable MDPs is mainly focused on objective functions defined w.r.t. numeric
costs (or rewards) that are assigned to transitions, e.g. (discounted) expected total reward
or limit-average reward. In contrast, we study qualitative objectives that are
expressed by Parity conditions and which are motived by formal verification questions.
There are works that studied particular classes
of countably infinite, but finitely branching,
MDPs that arise from models in automata theory
[14, 2, 8, 6, 1].
In each of these papers, a crucial part of the analysis is establishing the existence of optimal strategies of particular structure and memory requirements, but none of them looked at proving
such properties for general countable MDPs.
Countable MDPs also naturally occur in the analysis of queueing systems [17], gambling [4], and branching processes [22], which have multiple applications.
They also show up in the analysis of finite-state models, e.g. in two-player stochastic games [24, 13] when reasoning about an optimal strategy against a fixed (randomised and memory-full) strategy of the opponent.
Finite MDPs vs. Infinite MDPs:
It should be noted that many standard properties (and proof techniques)
of finite MDPs do not
carry over to infinite MDPs.
E.g., given some objective, consider the set of all states in an MDP
that have nonzero value.
If the MDP is finite then this set is finite and thus there exists some
minimal nonzero value.
This property does not carry over to infinite MDPs.
Here the set of states is infinite and the infimum over the nonzero values can be zero.
As a consequence, even for a reachability objective, it is possible that all states have
value >0, but still the value of some states is <1. Such phenomena appear
already in infinite-state Markov chains like the classic Gambler’s ruin problem with unfair
coin tosses in the player’s favor (0.6 win, 0.4 lose). The value,
i.e., the probability of ruin, is always >0, but still <1 in every state
except the ruin state itself; cf. [15] (Chapt. 14).
Another difference is that optimal strategies need not exist, even for
qualitative objectives like reachability or parity. Even if some state
has value 1, there might not be any single strategy that attains the value 1, but
only an infinite family of ϵ-optimal strategies for every ϵ>0.
Parity objectives:
We study general countably infinite MDPs with parity objectives.
Parity conditions are widely used in temporal
logic and formal verification, e.g., they can express ω-regular
languages and modal μ-calculus [16].
Every state has a color, out of a finite set of colors encoded as natural
numbers. An infinite play is winning iff the highest color that is seen
infinitely often in the play is even. The controller wants to maximize
the probability of winning plays.
Subclasses of parity objectives are defined by restricting the set of used
colors; these are classified in the Mostowski hierarchy [20]
which includes, e.g., Büchi and co-Büchi objectives.
Such prefix-independent infinitary objectives cannot generally be encoded by numeric
transition rewards as in [23], though both types subsume
the simpler reachability and safety objectives.
There are different types of strategies, depending on whether one can take
the whole history of the play into account (history-dependent; (H)),
or whether one is limited to a finite amount of memory (finite memory; (F))
or whether decisions are based only on the current state (memoryless; (M)).
Moreover, the strategy type depends on whether the controller can
randomize (R) or is limited to deterministic choices (D).
The simplest type MD refers to memoryless deterministic strategies.
The type of strategy needed for an optimal (resp. ϵ-optimal)
strategy for some objective is also called the strategy complexity of the objective.
For finite MDPs, MD-strategies are sufficient for all types of qualitative and
quantitative parity objectives [9, 11], but the picture is
more complex for countably infinite MDPs.
Since optimal strategies need not exist in general, we consider both the
strategy complexity of ϵ-optimal strategies, and
the strategy complexity of optimal strategies under the assumption that they
exist.
E.g., if an optimal strategy exists, can it be chosen MD?
We provide a complete picture of the memory requirements for objectives in the
Mostowski hierarchy, which is summarized in Figure 1.
In particular, our results show that there is a strong dichotomy between two different classes of
objectives. For objectives of the first class, optimal strategies,
where they exist, can be chosen MD.
For objectives of the second class, optimal strategies require infinite memory
in general, in the sense that all FR-strategies achieve the objective
only with probability zero.
A similar dichotomy applies to ϵ-optimal strategies.
For certain objectives, ϵ-optimal MD-strategies exist, while
for all others even ϵ-optimal strategies require infinite memory
in general.
This is a strong dichotomy because there are no objectives in the Mostowski
hierarchy for which other types of strategies (MR, FD, or FR) are both
necessary and sufficient.
Put differently, for all objectives in the Mostowski hierarchy, if FR-strategies suffice then MD-strategies suffice as well.
We also consider the subclass of countable MDPs that are finitely branching.
(Note that these generally still have an infinite number of states.)
The above mentioned dichotomies apply here as well, though the classes of
objectives where optimal (resp. ϵ-optimal) strategies can be chosen MD
are larger than for general countable MDPs.
Outline of the results:
In Section II we define countably infinite MDPs, strategies and parity
objectives.
In Section III we show examples that demonstrate that certain objectives
require infinite memory. For some of these we refer to previous work.
The main new result in this section is Theorem 1 that shows
that even almost-sure {1,2,3}-Parity on finitely branching MDPs requires
infinite memory.
These negative results highlight the questions which other objectives still
allow MD-strategies. Apart from the case of reachability objectives,
these questions were open. We provide complete answers in several steps.
First, in Section IV, we prove a general result
(Theorem 5)
that relates the strategy complexity of almost-sure winning strategies and
optimal strategies. The complexity of the proof is due to the fact that
we consider infinite MDPs (which do not satisfy basic properties of
finite MDPs in general; see above). We then use this theorem to establish
MD-strategies for Büchi, co-Büchi and {0,1,2}-Parity objectives in
the following sections.
In Section V we show that optimal strategies for
Büchi objectives, where they exist, can be chosen MD,
even for infinitely branching MDPs.
In Section VI we consider finitely branching MDPs.
We show that optimal strategies for {0,1,2}-Parity, where they exist,
can be chosen MD (Theorem 16). This is a very general result. E.g., this question
had been open (and is non-trivial) even for almost-sure co-Büchi objectives.
Moreover, we show that ϵ-optimal strategies for co-Büchi objectives
can be chosen MD (Theorem 19).
We conclude the paper with a discussion of how some results change when
one considers uncountable MDPs.
II Preliminaries
A probability distribution over a countable (not necessarily finite) set S is a function
f:S→[0,1] s.t. ∑s∈Sf(s)=1.
We use supp(f)={s∈S∣f(s)>0} to denote the support of f.
Let D(S) be the set of all probability distributions over S.
We consider countably infinite Markov decision processes (MDPs) M=(S,S□,S◯,⟶,P) where
the countable set S of states is partitioned into
the set S□ of states of the player and
random states S◯.
The relation ⟶⊆S×S is the transition relation.
We write s⟶s′ if (s,s′)∈⟶, and we assume that each state s has a successor state s′ with s⟶s′.
The probability function P:S◯→D(S) assigns to each random
state s∈S◯ a probability distribution over its
successor states. A set T⊆S is a sink in M if for all s∈T all successors of s are in T.
The MDP M is called finitely branching if each state has only finitely many successors;
otherwise, it is infinitely branching.
A Markov chain is an MDP where S□=∅, i.e., all states are random states.
We describe the behavior of an MDP as a one-player stochastic game played
for infinitely many rounds. The game
starts in a given initial state s0. In each round, if the game is in state s∈S□
then the player (or controller) chooses a successor state s′ with s⟶s′;
otherwise the game is in a random state s∈S◯ and proceeds randomly to s′ with probability P(s)(s′).
Strategies.
A play w is an infinite sequence
s0s1⋯ of states
such that si⟶si+1 for all i≥0;
let w(i)=si denote the i-th state along w.
A partial play is a finite prefix of a play.
We say that (partial) play w visits s if
s=w(i) for some i, and that w starts in s if s=w(0).
A strategy is a function σ:S∗S□→D(S) that assigns to partial plays
ws∈S∗S□
a distribution over the successors {s′∈S∣s⟶s′}.
The set of all strategies in
M is denoted by ΣM (we omit the subscript and write Σ if M is clear).
A (partial) play s0s1⋯ is induced by strategy σ
if si+1∈supp(σ(s0s1⋯si)) for all i with si∈S□, and
si+1∈supp(P(si)) for all i with si∈S◯.
Since this paper focuses on the memory requirements of strategies,
we present an equivalent formulation of strategies, emphasizing
the amount of memory required to implement a strategy.
Strategies can
be implemented by probabilistic transducers T=(M,m0,πu,πs)
where M is a countable set (the memory of the strategy),
m0∈M is the initial memory mode
and S is the input and output alphabet.
The probabilistic transition function πu:M×S→D(M) updates the
memory mode of transducer.
The probabilistic successor function πs:M×S□→D(S)
outputs the next successor, where s′∈supp(πs(m,s)) implies
s⟶s′.
We extend πu to D(M)×S→D(M) and
πs to D(M)×S□→D(S), in the natural way.
Moreover, we extend πu
to paths by
πu(m,ε)=m and
πu(m,s0⋯sn)=πu(πu(s0⋯sn−1,m),sn).
The strategy σT:S∗S□→D(S)
induced by the transducer T is given by
σT(s0⋯sn):=πs(sn,πu(s0⋯sn−1,m0)).
Note that such strategies allow for randomized memory updates and probabilistic successor functions.
Strategies are in general history dependent (H) and randomized (R).
An H-strategy σ is finite memory (F) if there exists some transducer T with memory M
such that σT=σ and ∣M∣<∞; otherwise we say σ requires infinite memory.
An F-strategy is memoryless (M) (also called positional)
if ∣M∣=1.
We may view
M-strategies as functions σ:S□→D(S).
An R-strategy σ is deterministic (D)
if πu and πs map to Dirac distributions;
it implies that σ(w) is a Dirac distribution for all partial plays w.
All combinations of the properties in {M,F,H}×{D,R} are possible, e.g., MD stands for
memoryless deterministic. HR strategies are the most general type.
Probability Measures.
An MDP M=(S,S□,S◯,⟶,P), an initial state s0, and a strategy σ induce a standard probability measure on sets of infinite plays.
We write PM,s0,σ(R) for the probability of a measurable set R⊆s0Sω of plays starting from s0.
It is defined, as usual, by first defining it on the cylinders s0s1…snSω, where s1,…,sn∈S:
if s0s1…sn is not a partial play induced by σ then set PM,s0,σ(s0s1…snSω)=0;
otherwise set PM,s0,σ(s0s1…snSω)=∏i=0n−1σˉ(s0s1…si)(si+1), where σˉ is the map that extends σ by σˉ(ws)=P(s) for any ws∈S∗S◯.
Using Carathéodory’s extension theorem [5], this defines a unique probability measure PM,s0,σ on measurable subsets of s0Sω.
Objectives.
Let M=(S,S□,S◯,⟶,P) be an MDP.
The objective of the player is determined by a predicate on
infinite plays.
We assume familiarity with the syntax and semantics of the temporal
logic LTL [12].
Formulas are interpreted on the structure (S,⟶).
We use
[[φ]]s⊆sSω to denote the set of plays starting from
s that satisfy the LTL formula φ.
This set is measurable [25],
and we just write
PM,s,σ(φ)
instead of
PM,s,σ([[φ]]s).
We also write [[φ]] for ⋃s∈S[[φ]]s.
Given a target set T⊆S, the reachability objective is defined by Reach(T)=[[FT]], i.e., s0s1⋯∈Reach(T)⇔∃i.si∈T.
The safety objective is defined by Safety(T)=[[G¬T]], i.e., s0s1⋯∈Safety(T)⇔∀i.si∈T.
Given a reachability or a safety objective, we can assume without loss of generality that T is a sink in M.
Let C⊆N be a finite set of colors.
A color function Col:S→C assigns to each state s its color Col(s).
For n∈N, ⊳∈{<,≤,=,≥,>} and Q⊆S, let [Q]Col⊳n:={s∈Q∣Col(s)⊳n}
be the set of states in Q with color ⊳n.
The parity objective is defined by
[TABLE]
i.e., Parity(Col) is the set of infinite plays such that the largest color that occurs infinitely often along the play is even.
The Mostowski hierarchy [20] classifies parity objectives
by restricting the range of the function Col to a set of colors C⊆N.
We write C-Parity for such restricted parity objectives.
In particular, Büchi objectives correspond to
{1,2}-Parity, and co-Büchi objectives correspond to
{0,1}-Parity.
The objectives {0,1,2}-Parity and {1,2,3}-Parity are incomparable,
but they both subsume (modulo renaming of colors) Büchi and co-Büchi objectives.
Moreover, both {0,1}-Parity and {1,2}-Parity subsume the reachability objective Reach(T) (for MDPs with a sink T), by defining the color function so that Col(s)=1⇔s∈T.
Similarly, both {0,1}-Parity and {1,2}-Parity subsume Safety(T), by defining Col(s)=1⇔s∈T.
Optimal and ϵ-Optimal Strategies.
Given an objective φ,
the value of state s in an MDP M,
denoted by valM(s), is
the supremum probability of achieving φ, i.e.,
valM(s):=supσ∈ΣPM,s,σ(φ).
For ϵ≥0 and s∈S, we say that
a strategy σ is ϵ-optimal iff
PM,s,σ(φ)≥valM(s)−ϵ.
A [math]-optimal strategy is called optimal.
An optimal strategy is almost-surely winning if valM(s)=1.
Unlike in finite-state MDPs, optimal strategies need not exist
in countable MDPs, not even for reachability objectives in finitely branching MDPs.
However, by the definition of the value, for all
ϵ>0, an ϵ-optimal strategy exists.
For an objective φ and ⊳∈{≥,>} and c∈[0,1],
we define \big{[}{\varphi}\big{]}^{{\rhd c}} as the set of states s for which there exists a strategy σ with PM,s,σ(φ)⊳c.
We call a state s almost-surely winning if s\in\big{[}{\varphi}\big{]}^{{\geq 1}},
and we call s limit-surely winning if s\in\big{[}{\varphi}\big{]}^{{\geq c}} for every constant c<1 (which is iff valM(s)=1).
On infinite arenas, limit-surely winning states are not necessarily almost-surely winning.
III Objectives that require infinite memory
In this section we consider those objectives in the Mostowski hierarchy
where optimal (resp., ϵ-optimal) strategies require infinite memory.
In each such case we construct an MDP that witnesses this requirement.
In these MDPs, all FR-strategies achieve the objective only with
probability [math], while some HD-strategy achieves the objective
almost-surely (resp., with arbitrarily high probability).
Theorem 1**.**
Let φ={1,2,3}-Parity. There exists a finitely branching MDP M with initial state s0 such that
for all FR-strategies σ, we have PM,s0,σ(φ)=0,
there exists an HD-strategy σ such that PM,s0,σ(φ)=1.
Hence, optimal (and even almost-surely winning) and ϵ-optimal strategies require infinite memory for {1,2,3}-Parity,
even in finitely branching MDPs.
The MDP in Theorem 1 is depicted
in Figure 2 (left),
where Col(si)=1 and Col(ri)=2 for all i∈N, and Col(t)=3.
For every FR-strategy there is a uniform lower bound on the probability of
visiting t between consecutive visits to s0.
Hence, unless the strategy with positive probability eventually
always stays in states si (and
thus also loses the almost-sure parity objective),
in the long-run, the probability of visiting t (with color three) tends to 1,
and the parity condition is satisfied with probability [math].
Although the player cannot win by any FR-strategy,
we construct an HD-strategy σ such that PM,s0,σ(φ)=1.
This strategy is such that
upon the ith visit to s0,
the ladder s0s1⋯si is traversed and the
transition si⟶ri is chosen.
Moving further along the ladder s0s1s2⋯
decreases the probability of visiting t between the previous and successive visits to s0.
Hence, the probability of visiting
color three infinitely often is [math].
Remark 1**.**
A strict subclass of finitely branching MDPs are 1-counter MDPs,
where a finite-state MDP is augmented with an integer counter
[6].
The MDP in Theorem 1 (plus some auxiliary states)
is implementable by a 1-counter MDP.
Remark 2**.**
The classical Rabin and Streett conditions can encode {1,2,3}-Parity.
Thus, optimal and ϵ-optimal strategies for Rabin/Streett require infinite memory,
even in finitely branching countable MDPs.
On finite MDPs, optimal strategies can be chosen MD for parity and Rabin
objectives, but not for Streett objectives.
Optimal strategies for Streett objectives can be chosen MR or FD [9].
Proof.
For an infinite play π∞, let Inf(π∞) be the set of states that π∞ visits infinitely often.
Let us recall the Rabin and Streett conditions.
Given a Rabin condition {(E1,F1),(E2,F2),⋯,(En,Fn)} with n pairs (or n disjunctions),
an infinite play π∞ satisfies the Rabin condition if there exists a pair (Ei,Fi) such that
Inf(π∞)∩Ei=∅ and Inf(π∞)∩Fi=∅.
The Rabin condition
[TABLE]
encodes {1,2,3}-Parity, since
all satisfying runs must visit states with color 2 infinitely often
and states with color 3 only finitely often.
Note that {1,2,3}-Parity is encoded in a Rabin condition with only one disjunction.
Given a Streett condition {(E1,F1),(E2,F2),⋯,(En,Fn)} with n pairs (or n conjunctions),
an infinite play π∞ satisfies the Streett condition if
Inf(π∞)∩Ei=∅ implies Inf(π∞)∩Fi=∅ for all pairs (Ei,Fi).
The Streett condition
[TABLE]
encodes {1,2,3}-Parity, since
all satisfying runs must visit states with color 2 infinitely often
and states with color 3 only finitely often.
Note that a conjunction of two Streett pairs are needed to encode
{1,2,3}-Parity. A single Streett pair {(X,Y)} means
“infinitely often X or only finitely often Y”, which can be encoded
as a {0,1,2}-Parity condition by assigning color 2 to X and color
1 to Y. Unlike for {1,2,3}-Parity, optimal strategies for
{0,1,2}-Parity (and thus also for a single Streett pair)
can be chosen MD in finitely branching MDPs
(Theorem 16).
∎
It was known that quantitative Büchi objectives require infinite memory [18, 2].
For the sake of completeness, we present an example MDP for Proposition 2
in Figure 2 (right).
Proposition 2** ([18]).**
Let φ={1,2}-Parity be the Büchi objective.
There exists a finitely branching MDP M with initial state s0 such that
for all FR-strategies σ, we have PM,s0,σ(φ)=0,
for every c∈[0,1), there exists an HD-strategy σ such that PM,s0,σ(φ)≥c.
*Hence, ϵ-optimal strategies for Büchi objectives require infinite memory.
*
Theorem 3**.**
Let φ=Safety(T).
There exists an infinitely branching MDP M with initial state s such that
for all FR-strategies σ, we have PM,s,σ(φ)=0,
for every c∈[0,1), there exists an HD-strategy σ such that PM,s,σ(φ)≥c.
Hence, ϵ-optimal strategies for safety require infinite memory.
The MDP in Theorem 3, depicted in
Figure 3 (left),
was first introduced in [19].
Since our notion of finite-memory strategies allows for randomized memory updates (in contrast to [19]),
our proof is somewhat more general.
The target is T={t}.
For every FR-strategy there is a uniform lower bound on the probability of
reaching t between consecutive visits to s0. Since t is absorbing, it
will be reached with probability 1. Thus every FR-strategy satisfies the
safety objective with probability [math].
However, for all n∈N,
we construct
an HD-strategy σn such
that PM,s,σn(Safety({t}))≥1−2n1.
This strategy is such that upon the ith visit to s, the
transition s⟶ri+n is chosen.
Hence, the probability of visiting t between two successive visits to s decreases.
A more detailed analysis shows that the probability of ever visiting t
is bounded by 2n1.
Theorem 4**.**
Let φ={0,1}-Parity be the co-Büchi objective.
There exists an infinitely branching MDP M with initial state s such that
for all FR-strategies σ, we have PM,s,σ(φ)=0,
there exists an HD-strategy σ such that PM,s,σ(φ)=1.
Hence, optimal (and even almost-surely winning) strategies and ϵ-optimal strategies for co-Büchi require infinite memory.
The MDP in Theorem 4 is depicted in
Figure 3 (right).
By a similar argument as in Theorem 3,
every FR-strategy achieves co-Büchi with probability [math].
However, the HD-strategy σ that chooses the
transition s⟶ri upon the ith visit to s
is such that PM,s,σ(φ)=1.
IV From almost-sure winning to optimal strategies
In this section we prove Theorem 5.
It says that, for certain objectives, if
almost-surely winning strategies (where they exist) can be chosen MD,
then optimal strategies (where they exist) can also be chosen MD.
We call a class C of MDPs downward-closed if every MDP whose transition relation is a subset of the transition relation of some MDP in C is also in C.
The class of finitely branching MDPs is downward-closed, and so is the class of MDPs with a fixed sink T.
We call an objective φ prefix-independent in C (where C is a class of MDPs) if for all w1,w2∈S∗ and all w∈Sω such that w1w and w2w are infinite plays in an MDP in C we have w1w∈[[φ]]⟺w2w∈[[φ]].
Parity objectives are prefix-independent in the class of all MDPs.
Both objectives Reach(T) and Safety(T) are prefix-independent in the class of MDPs with sink T.
The following theorem provides, under certain conditions, an optimal MD-strategy for all states that have an optimal strategy.
In fact, a single MD-strategy is optimal for all states that have an optimal strategy:
Theorem 5**.**
Let φ be an objective that is prefix-independent in a downward-closed class C of MDPs.
Suppose that for any M=(S,S□,S◯,⟶,P)∈C and any s∈S and any strategy σ with PM,s,σ(φ)=1 there exists an MD-strategy σ′ with PM,s,σ′(φ)=1.
Under this condition, for each M∈C
there is an MD-strategy σ′ such that for all s∈S:
[TABLE]
The remainder of the section is devoted to the proof of Theorem 5.
For prefix-independent winning conditions, whenever an optimal strategy visits some state, it achieves the value of this state; see Lemma 20 in the appendix.
We use this to show that the MDP constructed in the following lemma is well-defined.
This MDP, M∗, will be crucial for the proof of Theorem 5.
Loosely speaking, M∗ is the MDP M conditioned under φ.
Lemma 6**.**
Let φ be an objective that is prefix-independent in a class C of MDPs.
Let M=(S,S□,S◯,⟶,P)∈C. Construct an MDP M∗=(S∗,S∗□,S∗◯,⟶∗,P∗) by setting
[TABLE]
and S∗□=S∗∩S□ and S∗◯=S∗∩S◯ and
[TABLE]
and P∗:S∗◯→D(S∗) so that
[TABLE]
for all s∈S∗◯ and t∈S∗ with s⟶∗t. Then:
-
For all σ∈ΣM∗ and all n≥0 and all s0,…,sn∈S∗ with s0⟶∗s1⟶∗⋯⟶∗sn:
[TABLE]
2. 2.
For all s0∈S∗ and all σ∈ΣM with PM,s0,σ(φ)=valM(s0)>0 and all measurable R⊆s0Sω we have
PM∗,s0,σ(R)=PM,s0,σ(R∣[[φ]]s0).
The following lemma provides, under certain conditions, a uniform almost-surely winning MD-strategy, i.e., one that works for all initial states at the same time:
Lemma 7**.**
Let M=(S,S□,S◯,⟶,P) be an MDP.
Let φ be an objective that is prefix-independent in {M}.
Suppose that for any s∈S and any strategy σ with PM,s,σ(φ)=1 there exists an MD-strategy σ′ with PM,s,σ′(φ)=1.
Then there is an MD-strategy σ′ such that for all s∈S:
[TABLE]
Proof.
We can assume that all states are almost-surely winning, since in order to achieve an almost-sure winning objective, the player must forever remain in almost-surely winning states.
So we need to define an MD-strategy σ′ so that for all s∈S we have PM,s,σ′(φ)=1.
Fix an arbitrary state s1∈S.
By assumption there is an MD-strategy σ1 with PM,s1,σ1(φ)=1.
Let U1⊆S be the set of states that occur in plays that both start from s1 and are induced by σ1.
We have PM,s1,σ1([[φ]]s1∩U1ω)=1.
In fact, for any s∈U1 and any strategy σ that agrees with σ1 on U1 we have PM,s,σ([[φ]]s∩U1ω)=1.
If U1=S we are done.
Otherwise, consider the MDP M1 obtained from M by fixing σ1 on U1 (i.e., in M1 we can view the states in U1 as random states).
We argue that, in M1, for any state s there is an MD-strategy σ1′ with PM1,s,σ1′(φ)=1.
Indeed, let s∈S be any state.
Recall that there is an MD-strategy σ with PM,s,σ(φ)=1.
Let σ1′ be the MD-strategy obtained by restricting σ to the non-U1 states (recall that the U1 states are random states in M1).
This strategy σ1′ almost surely generates a play that either satisfies φ without ever entering U1 or at some point enters U1.
In the latter case, φ is satisfied almost surely: this follows from prefix-independence and the fact that σ1′ agrees with σ1 on U1.
We conclude that PM1,s,σ1′(φ)=1.
Let s2∈S∖U1.
We repeat the argument from above, with s2 instead of s1, and with M1 instead of M.
This yields an MD-strategy σ2 and a set U2∋s2 with PM1,s2,σ2([[φ]]s2∩U2ω)=1.
In fact, for any s∈U2 and any strategy σ that agrees with σ2 on U2 and with σ1 on U1 we have PM,s,σ([[φ]]s∩U2ω)=1.
If U1∪U2=S we are done.
Otherwise we continue in the same manner, and so forth.
Since S is countable, we can pick s1,s2,… to have ⋃i≥1Ui=S.
Define an MD-strategy σ′ such that for any s∈S□ we have σ′(s)=σi(s) for the smallest i with s∈Ui.
Thus, if s∈Ui, we have PM,s,σ′(φ)≥PM,s,σ′([[φ]]s∩Uiω)=1.
∎
The following measure-theoretic lemma will be used to connect probability measures induced by the MDPs M and M∗ from Lemma 6.
Lemma 8**.**
Let S be countable and s∈S.
Call a set of the form swSω for w∈S∗ a cylinder.
Let P,P′ be probability measures on sSω defined in the standard way, i.e., first on cylinders and then extended to all measurable sets R⊆sSω.
Suppose there is x≥0 such that x⋅P(C)≤P′(C) for all cylinders C.
Then x⋅P(R)≤P′(R) holds for all measurable R⊆sSω.
We are ready to prove Theorem 5.
Proof of Theorem 5.
As in the statement of the theorem, suppose that φ is an objective that is prefix-independent in a downward-closed class C of MDPs so that for any M=(S,S□,S◯,⟶,P)∈C and any s∈S and any strategy σ with PM,s,σ(φ)=1 there exists an MD-strategy σ′ with PM,s,σ′(φ)=1.
Let M=(S,S□,S◯,⟶,P)∈C. Let M∗=(S∗,S∗□,S∗◯,⟶∗,P∗) be the MDP defined in Lemma 6.
Since C is downward-closed, we have M∗∈C.
In particular, φ is prefix-independent in {M∗}.
First we show that for any s∈S∗ there exists an MD-strategy σ′ with PM∗,s,σ′(φ)=1.
Indeed, let s∈S∗.
By the definition of S∗, there is a strategy σ with PM,s,σ(φ)=valM(s)>0.
By Lemma 6.2, we have PM∗,s,σ(φ)=1.
By our assumption on C there exists an MD-strategy σ′ with PM∗,s,σ′(φ)=1.
By Lemma 7, it follows that there is an MD-strategy σ′ with PM∗,s,σ′(φ)=1 for all s∈S∗.
We show that this strategy σ′ satisfies the property claimed in the statement of the theorem.
To this end, let n≥0 and s0,s1,…,sn∈S.
If s0s1⋯sn is a partial play in M∗ then, by Lemma 6.1,
[TABLE]
If s0s1⋯sn is not a partial play in M∗ then PM∗,s0,σ′(s0s1⋯snSω)=0 and the previous inequality holds as well.
Therefore, by Lemma 8, we get for all measurable sets R⊆s0Sω:
[TABLE]
In particular, since PM∗,s0,σ′(φ)=1, we obtain valM(s0)≤PM,s0,σ′(φ).
The converse inequality PM,s0,σ′(φ)≤valM(s0) holds by the definition of valM(s0), hence we conclude PM,s0,σ′(φ)=valM(s0).
∎
V When MD-strategies suffice in general countable MDPs
Ornstein [21] shows that ϵ-optimal and optimal strategies for reachability can be chosen MD:
Theorem 9** (from Theorem B in [21]).**
For every countable MDP M there exist uniform ϵ-optimal MD-strategies for
reachability objectives φ=Reach(T), i.e.,
for every ϵ>0 there is an
MD-strategy σϵ such that for all s∈S we have
PM,s,σϵ(φ)≥valM(s)−ϵ.
Theorem 10** (follows from Proposition B in [21]).**
Let M=(S,S□,S◯,⟶,P) be an MDP, and φ=Reach(T). Let s0∈S and σ be a strategy with PM,s0,σ(φ)=1.
Then there is an MD-strategy σ^ with PM,s0,σ^(φ)=1.
Both theorems are due to [21]; we give an alternative proof of Theorem 10 in the appendix.
We generalize Theorem 10 to Büchi objectives, using the principle that Büchi is repeated reachability:
Proposition 11**.**
Let M=(S,S□,S◯,⟶,P) be an MDP, and s0∈S, and σ a strategy, and Col:S→{1,2}, and φ=Parity(Col).
Suppose PM,s0,σ(φ)=1.
Then there is an MD-strategy σ′ with PM,s0,σ′(φ)=1.
By appealing to Theorem 5 it follows:
Theorem 12**.**
Let M be an MDP, Col:S→{1,2}, and
φ=Parity(Col) be a Büchi-objective
(subsuming reachability and safety).
Then there exists an MD-strategy σ′ that is optimal for all states
that have an optimal strategy:
[TABLE]
VI When MD-strategies suffice in finitely branching MDPs
In this section we prove that optimal strategies for {0,1,2}-Parity, where they exist,
can be chosen MD (Theorem 16) and that ϵ-optimal strategies for co-Büchi objectives
can be chosen MD (Theorem 19).
To prepare the ground for these results, we first consider safety objectives.
VI-A Optimal MD-strategies for Safety
The following proposition asserts in particular that for safety in finitely branching MDPs, there is no need for merely ϵ-optimal strategies, as there always exists an optimal MD-strategy.
Proposition 13** (from Theorem 7.3.6(a) in [23]).**
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP, and T⊆S, and φ=Safety(T).
Define an MD-strategy σopt\mathchar45av (for “optimal avoiding”) that, in each state s, picks a successor state with the largest value valM(s)=supσ∈ΣPM,s,σ(φ).
Then for all states s∈S we have
PM,s,σopt\mathchar45av(φ)=valM(s), i.e.,
σopt\mathchar45av is uniformly optimal.
Note that, for infinitely branching MDPs, this definition of σopt\mathchar45av would be
unsound, as “the largest value” might not exist.
Definition 1**.**
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP,
Col:S→N a color function,
φ=Safety([S]Col=0),
σopt\mathchar45av the strategy from Proposition 13
and τ∈[0,1]. We define
[TABLE]
i.e., SafeM(τ) is the set of states from which the player can remain within color-0 states forever with probability ≥τ.
We drop the subscript M when the MDP M is understood.
Loosely speaking, the following lemma gives a lower bound on the probability that, starting from a “safe” state, “unsafe” states are forever avoided by σopt\mathchar45av:
Lemma 14**.**
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP, Col:S→N
a color function and
σopt\mathchar45av the strategy from Proposition 13.
Let 0<τ1≤τ2≤1, and s∈Safe(τ2).
Then PM,s,σopt\mathchar45av(GSafe(τ1))≥1−τ1τ2−τ1.
Proof.
We compute probabilities conditioned under the event GSafe(τ1).
Since Safe(τ1)⊆[S]Col=0, we have PM,s,σopt\mathchar45av(G[S]Col=0∣GSafe(τ1))=1.
From the definition of Safe(τ1) and the Markov property we get PM,s,σopt\mathchar45av(G[S]Col=0∣¬GSafe(τ1))≤τ1.
Applying the law of total probability and writing x for PM,s,σopt\mathchar45av(GSafe(τ1)) we obtain:
[TABLE]
It follows x≥1−τ1τ2−τ1.
∎
The following lemma states for all τ<1 that eventually remaining in color-0 states but outside Safe(τ) has probability zero.
Lemma 15**.**
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP, and Col:S→N a color function.
Let s be a state, and σ a strategy, and τ<1.
Then PM,s,σ(FG¬Safe(τ)∧FG[S]Col=0)=0.
VI-B Optimal MD-strategies for {0,1,2}-Parity
Theorem 16**.**
Let M be a finitely branching MDP, Col:S→{0,1,2}, and
φ=Parity(Col).
Then there exists an MD-strategy σ′ that is optimal for all states
that have an optimal strategy:
[TABLE]
By appealing to Theorem 5 it suffices to show:
Proposition 17**.**
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP, and s0∈S, and σ a strategy, and Col:S→{0,1,2}, and φ=Parity(Col).
Suppose PM,s0,σ(φ)=1.
Then there is an MD-strategy σ′ with PM,s0,σ′(φ)=1.
The following simple lemma provides a scheme for proving almost-sure properties.
Lemma 18**.**
Let P be a probability measure over the sample space Ω.
Let (Ri)i∈I be a countable partition of Ω in measurable events.
Let E⊆Ω be a measurable event.
Suppose P(Ri∩E)=P(Ri) holds for all i∈I.
Then P(E)=1.
We are ready to prove Proposition 17.
Proof of Proposition 17.
To achieve an almost-sure winning objective, the player must forever remain in states from which the objective can be achieved almost surely.
So we can assume without loss of generality that all states are almost-sure winning, i.e., for all s∈S we have PM,s,σ(φ)=1 for some σ.
We will define an MD-strategy σ′ with PM,s,σ′(φ)=1 for all s∈S.
We first define the MD-strategy σ′ partially for the states in SafeM(31) and then extend the definition of σ′ to all states.
For the states in SafeM(31) define σ′:=σopt\mathchar45av as in Proposition 13, see Figure 4.
Let M′ be the MDP obtained from M by restricting the transition relation as prescribed by the partial MD-strategy σ′.
For any τ∈[0,1], we have SafeM(τ)=SafeM′(τ).
Indeed, since M′ restricts the options of the player, we have SafeM(τ)⊇SafeM′(τ).
Conversely, let s∈SafeM(τ).
The strategy σopt\mathchar45av from Proposition 13 achieves PM,s,σopt\mathchar45av(G[S]Col=0)≥τ.
Since σopt\mathchar45av can be applied in M′, and results in the same Markov chain
as applying it in M, we conclude s∈SafeM′(τ).
This justifies to write Safe(τ) for SafeM(τ)=SafeM′(τ) in the remainder of the proof.
Next we show that, also in M′, for all states s∈S there exists a strategy σ1 with PM′,s,σ1(φ)=1.
This strategy σ1 is defined as follows.
First play according to a strategy σ from the statement of the theorem.
If and when the play visits Safe(31), switch to the MD-strategy σopt\mathchar45av from Proposition 13.
If and when the play then visits [S]Col=0, switch back to a strategy σ from the statement of the theorem, and so forth.
We show that σ1 achieves PM′,s,σ1(φ)=1.
To this end we will use Lemma 18.
We partition the runs of sSω in three events R0,R1,R2 as follows:
R0 contains the runs where σ1 switches between σopt\mathchar45av and σ infinitely often.
R1 contains the runs where σ1 eventually only plays according to σopt\mathchar45av.
R2 contains the runs where σ1 eventually only plays according to σ.
Each time σ1 switches to σopt\mathchar45av, there is, by Proposition 13,
a probability of at least 31 of never visiting a color-{1,2} state again and thus of never again switching to σ.
It follows that PM′,s,σ1(R0)=0.
By the definition of σopt\mathchar45av we have R1⊆[[FG[S]Col=0]]⊆[[φ]], and hence PM′,s,σ1(R1∩[[φ]])=PM′,s,σ1(R1).
Since PM,s,σ(φ)=1 and φ is prefix-independent, we have
PM′,s,σ1(R2∩[[φ]])=PM′,s,σ1(R2).
Using Lemma 18, we obtain PM′,s,σ1(φ)=1.
Next we show that for all s∈S the strategy σ1 defined above achieves PM′,s,σ1(FSafe(32)∨F[S]Col=2)=1.
To this end we will use Lemma 18 again.
We partition the runs of sSω into three events R1′,R2′,R0′ as follows:
R1′=[[FG[S]Col=0]]s
R2′=[[GF[S]Col=2]]s
R0′=sSω∖[[φ]]s
We have previously shown that PM′,s,σ1(φ)=1, hence PM′,s,σ1(R0′)=0.
By Lemma 15, almost all runs in R1′ satisfy GFSafe(32).
Since [[GFSafe(32)]]⊆[[FSafe(32)]], we have PM′,s,σ1(R1′∩[[FSafe(32)∨F[S]Col=2]])=PM′,s,σ1(R1′).
Since R2′⊆[[F[S]Col=2]], we also have PM′,s,σ1(R2′∩[[FSafe(32)∨F[S]Col=2]])=PM′,s,σ1(R2′).
Using Lemma 18 we obtain PM′,s,σ1(FSafe(32)∨F[S]Col=2)=1.
Writing T=Safe(32)∪[S]Col=2 we have just shown that for all s∈S there is a strategy σ1 with PM′,s,σ1(FT)=1.
By Lemma 7 there is an MD-strategy σ^ for M′ with PM′,s,σ^(FT)=1 for all s∈S.
We extend the (so far partially defined) strategy σ′ by σ^.
Thus we obtain a (fully defined) strategy σ′ for M such that for all s∈S we have PM,s,σ′(FT)=1.
It remains to show that for all s∈S we have PM,s,σ′(φ)=1.
To this end we will use Lemma 18 again.
We partition the runs of sSω in two events R1′′,R2′′:
R1′′=[[GFSafe(32)]]s, i.e., R1′′ contains the runs that visit Safe(32) infinitely often.
R2′′=[[FG¬Safe(32)]]s, i.e., R2′′ contains the runs that from some point on never visit Safe(32).
Every time a run enters Safe(32), by Lemma 14, the probability is at least 21 that the run remains in Safe(31) forever.
It follows that almost all runs in R1′′ eventually remain in Safe(31) forever, i.e., PM,s,σ′(R1′′∩[[FGSafe(31)]])=PM,s,σ′(R1′′).
Since Safe(31)⊆[S]Col=0, we have [[FGSafe(31)]]⊆[[FG[S]Col=0]]⊆[[φ]].
Hence also PM,s,σ′(R1′′∩[[φ]])=PM,s,σ′(R1′′).
We have previously shown that PM,s,σ′(FT)=1 holds for all s∈S.
Hence also PM,s,σ′(GFT)=1 holds for all s∈S.
In particular, almost all runs in R2′′ satisfy GFT.
By comparing the definitions of R2′′ and T we see that almost all runs in R2′′ even satisfy GF[S]Col=2.
Since [[GF[S]Col=2]]⊆[[φ]], we obtain PM,s,σ′(R2′′∩[[φ]])=PM,s,σ′(R2′′).
A final application of Lemma 18 yields
PM,s,σ′(φ)=1 for all s∈S.
∎
VI-C ϵ-Optimal MD-strategies for Co-Büchi
Theorem 19**.**
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP,
Col:S→{0,1}, and φ=Parity(Col) be the
co-Büchi objective. Then there exist uniform ϵ-optimal
MD-strategies.
I.e., for every ϵ>0 there is an MD-strategy σϵ
with PM,s0,σϵ(φ)≥valM(s0)−ϵ
for every s0∈S.
Proof.
Let ϵ1>0 be a suitably small number (to be determined later),
τ1:=1−ϵ1 and
SafeM(τ1) defined as in Definition 1.
Let σopt\mathchar45av be the MD-strategy
from Proposition 13.
From M we obtain a modified MDP M′ by fixing all player choices from states in
SafeM(τ1) according to σopt\mathchar45av.
We show that valM′(s0)≥valM(s0)−ϵ1.
By definition of the value valM(s0),
for every δ>0 there exists a strategy σδ in M
s.t. PM,s0,σδ(φ)≥valM(s0)−δ.
We define a strategy σδ′ in M′ from state s0 as
follows.
First play like σδ. If and when a state in SafeM(τ1) is
reached, play like σopt\mathchar45av. This is possible, since no moves
from states outside SafeM(τ1) have been fixed in M′,
and all moves from states inside SafeM(τ1) have been fixed
according to σopt\mathchar45av.
Then we have:
[TABLE]
Since this holds for every δ>0 we obtain
valM′(s0)≥valM(s0)−ϵ1.
Now let τ2:=1−ϵ1/k for a suitably large k≥1 (to be
determined later) and
SafeM′(τ2)
be defined as in Definition 1.
In particular, SafeM′(τ2)=SafeM(τ2)
(by the same argument as in the proof of Proposition 17).
By definition of the value, for every ϵ2>0 there exists a
strategy σϵ2 in M′ with
PM′,s0,σϵ2(φ)≥valM′(s0)−ϵ2.
Moreover, by Lemma 15 and τ2<1,
PM′,s0,σ(FSafeM′(τ2))≥PM′,s0,σ(φ) for every strategy σ and
thus in particular for σϵ2.
Therefore, PM′,s0,σϵ2(FSafeM′(τ2))≥valM′(s0)−ϵ2.
By Theorem 9, for every ϵ3>0
there exists an MD-strategy σ′
in M′ with
PM′,s0,σ′(FSafeM′(τ2))≥valM′(s0)−ϵ2−ϵ3.
In particular, σ′ must coincide with σopt\mathchar45av
at all states in SafeM(τ1), since in M′ these choices are
already fixed.
We obtain the MD-strategy σϵ in M by
combining the MD-strategies σ′ and σopt\mathchar45av.
The strategy σϵ plays like
σ′ at all states outside SafeM(τ1) and like
σopt\mathchar45av at all states inside SafeM(τ1).
In order to show that σϵ has the required property
PM,s0,σϵ(φ)≥valM(s0)−ϵ, we
first estimate the probability that a play according to σϵ will never
leave the set SafeM(τ1) after having visited a state in
SafeM′(τ2).
Let s∈SafeM′(τ2).
Then, by Lemma 14,
[TABLE]
In particular we also have
PM,s,σϵ(GSafe(τ1))≥1−k1,
since σϵ coincides with σopt\mathchar45av inside the set
SafeM(τ1).
Finally we obtain:
[TABLE]
This holds for every ϵ1,ϵ2,ϵ3>0 and
every k≥1, and moreover valM(s0)≤1.
Thus we can set ϵ1=ϵ2=ϵ3:=ϵ/6 and
k:=ϵ2
and obtain PM,s0,σϵ(φ)≥valM(s0)−ϵ for every s0∈S as required.
∎
VII Discussion
Our results on the memory requirements of (ϵ)-optimal strategies
(Figure 1) directly imply how much memory is needed to win
quantitative objectives of type \big{[}{\varphi}\big{]}^{{\rhd c}}
(considered, e.g., in [7]).
For c<1 the assumed winning strategy might have to be an ϵ-optimal
one, since optimal strategies do not always exist. Thus MD-strategies are only
sufficient for reachability objectives in countable MDPs (resp., for {0,1}-Parity, safety
and reachability objectives in finitely branching MDPs).
In the special case of \big{[}{\varphi}\big{]}^{{\geq 1}} objectives (i.e., winning
almost-surely), the winning strategy (assuming it exists) must be optimal.
Thus MD-strategies are only sufficient for safety, reachability and
{1,2}-Parity in countable MDPs
(resp., for all objectives subsumed by {0,1,2}-Parity in
finitely branching MDPs).
In this paper we have studied countable MDPs. Not all our results carry over
to uncountable MDPs. The first issue is measurability. The probabilities are
only well-defined if the strategies are measurable functions, which might not
exist without further conditions on the MDP; cf. Section 2.3
in [23]. Another issue is that strategies cannot
generally be chosen uniform, i.e., independent of the initial state.
E.g., in countable MDPs ϵ-optimal strategies for reachability
can be chosen uniform MD (Theorem 9), but this
does not carry over to uncountable MDPs (Thm. A
in [21]). However, optimal strategies for reachability,
if they exist, can be chosen uniform MD (Proposition B
in [21]).
Acknowledgements.
This work was partially supported by the EPSRC
through grants EP/M027287/1, EP/M027651/1, EP/P020909/1 and EP/M003795/1
and by St. John’s College, Oxford.
-A Proofs of Section III
We recall two results that are used throughout the proofs in this section:
Strong fairness of probabilistic choices in Markov chains.
Given a Markov chain, let Pq(⋅) denotes
the probability of events starting in state q of the chain.
Let p,q be two states and π a finite path
starting in p with strictly positive probability.
Strong fairness of probabilistic choices states that Pq(GFπ)=Pq(GFp).
Intuitively, it means that under the condition that state p is visited infinitely often,
any finite path starting in p will be taken infinitely often, almost surely [3, Theorem 10.25].
The Borel-Cantelli lemma.
Suppose that (En)n∈N is a sequence of events in a probability space.
Denote by E∞ the event
[TABLE]
that
intuitively is the event “En occurs for infinitely many n”.
The Borel-Cantelli lemma asserts that if
∑n=1∞P(En)<∞ then P(E∞)=0.
Informally speaking, if the sum of probabilities of the events En is bounded then
the probability that infinitely many of them occur is zero [5].
Theorem 1.
*
Let φ={1,2,3}-Parity. There exists a finitely branching MDP M with initial state s0 such that*
for all FR-strategies σ, we have PM,s0,σ(φ)=0,
there exists an HD-strategy σ such that PM,s0,σ(φ)=1.
*Hence, optimal (and even almost-surely winning) and ϵ-optimal strategies require infinite memory for {1,2,3}-Parity,
even in finitely branching MDPs.
*
Proof.
Consider the MDP M shown (on the left side) in Figure 2 where
S◯={ri}i∈N and
S□={t}∪{si}i∈N.
For all i∈N, there are transitions
si⟶ri and si⟶si+1 in si states, whereas
P(ri,t)=2−i and P(ri,s0)=1−2−i in random states.
Let σ be an arbitrary FR-strategy. We prove that PM,s0,σ(φ)=0.
By definition there is a transducer T with
finite memory M and initial mode m0 such that σT=σ.
Let MT be the Markov chain obtained by the product of the MDP M and the transducer T.
The set of states in MT is M×S.
We define a coloring function for MT such that it ignores the memory mode and assigns to (m,s)
the same color as state s in M.
By a slight abuse of notation, we use the same notation Col for the coloring functions of both M and MT.
We also denote by PMT,q(R) the probability of a measurable set R of infinite paths (i.e., infinite plays),
starting in the state q of MT.
We prove that PM,s0,σ(GF[S]Col=2∧FG[S]Col=3)=0.
Equivalently, we show that PMT,(m0,s0)(φT)=0
where φT=GF[M×S]Col=2∧FG[M×S]Col=3.
We proceed in three steps:
we will show, using strong fairness of probabilistic choices
in Markov chains, that for all modes m∈M,
[TABLE]
Moreover,
[TABLE]
Since the memory of strategy σ is finite (∣M∣<∞), we will show that
[TABLE]
Using (1)-(3) we complete the proof as follows:
[TABLE]
As a result PM,s0,σ(φ)=0. Below, we prove (1), (2) and (3).
We first highlight two properties of the Markov chain MT. Consider the MDP M:
(i) the only states with color 2 are random states ri∈S◯, wherein the only successors are s0 and t.
Hence, from all states (m,ri)∈M×S◯ in MT,
there are successors q1,q2 such that q1∈M×{s0}
and q2∈M×{t}; moreover, all successors are in M×{s0,t}.
(ii) The state t has the unique successor s0.
Hence, in MT, from all states (m,t)∈M×S◯
all successors q are such that q∈M×{s0}.
To establish (1), let (m,s0)∈M×{s0} be some state in the Markov chain MT.
For the case PMT,(m0,s0)(GF(m,s0))=0, we trivially have (1).
Therefore, we assume that PMT,(m0,s0)(GF(m,s0))>0.
So,
there exists an infinite path satisfying GF(m,s0).
Hence, there exists a finite path π from (m,s0) to itself.
By the structure of the chain, π
has a prefix (ms0,s0)(ms1,s1)⋯(msi,si)∈(M×S□)∗ traversing player’s states,
with ms0=m, and next visiting some state (mri,ri).
By property (i), we know that (mri,ri) has a successor (mt,t)∈M×{t}.
It implies that
[TABLE]
is a finite path in MT, starting in (m,s0) with positive probability.
By strong fairness of probabilistic choices,
PMT,(m0,s0)(GFπt)=PMT,(m0,s0)(GF(m,s0)).
In other words, under the condition that state (m,s0) is visited infinitely often,
the finite path πt will be taken infinitely often, almost-surely.
As an immediate result of this and the fact that (mt,t) has color 3, we conclude (1).
To establish (2),
we need to show that, for all infinite plays π∞ of MT,
if π∞∈[[GF[M×S]Col=2]]
then π∞∈[[GF(M×{s0})]].
By the properties (i) and (ii) of MT, whenever π∞ visits some state from M×S◯,
which are the only states with color 2,
then π∞ must visit some state from M×{s0} within two steps. This results in (2).
To establish (3), we observe that
the inclusion ⊇ is trivial. To show ⊆, let π∞∈[[GF(M×{s0})]]
be an infinite path in the chain. As π∞ visits infinitely many elements from the finite set M×{s0},
there must exist some element (m,s0) that is visited infinitely often. Hence,
π∞∈[[⋁m∈MGF(m,s0)]], which gives the inclusion and thus (3).
Now, we construct an HD-strategy σh such that PM,s0,σh(φ)=1.
The strategy σh is defined, for all partial plays ρ, as follows:
[TABLE]
Intuitively, σh is such that
upon the k-th visit to state s0,
the path s0s1⋯sk is traversed and then the
transition sk⟶rk is chosen.
Observe that PM,s0,σh(GF[S]Col=2)=1.
Below, we argue that PM,s0,σh(GF[S]Col=3)=0,
which proves that PM,s0,σh(φ)=1.
We define the sequence of events Ek
of visiting t between the kth and k+1st visits of s0.
For k≥1, let
[TABLE]
Observe that
[TABLE]
We use the
Borel-Cantelli lemma to prove that infinitely many of Ek’s occur with zero probability,
that is the probability of GF[S]Col=3.
In fact, by construction of σh, observe that PM,s0,σh(Ek)=2−k.
Consequently, we have
[TABLE]
By the Borel-Cantelli lemma, we then have PM,s0,σh(GF[S]Col=3)=0, and thus
PM,s0,σh(φ)=1.
The proof is complete.
∎
Remark 1.
*
A strict subclass of finitely branching MDPs are 1-counter MDPs,
where a finite-state MDP is augmented with an integer counter
[6].
The MDP in Theorem 1 (plus some auxiliary states)
is implementable by a 1-counter MDP.
*
Proof.
Figure 5 shows a 1-counter
MDP with control-states {s,r,r′,t} that
is functionally equivalent to the one used
in Theorem 1 (Figure 2 (left)).
It just uses some auxiliary states that have no influence on the parity objective.
Starting in s, the player can choose whether to increase the counter by 1
or to go to r.
In the random state r the behavior depends on the counter value.
If the counter is non-zero then the successors r,r′ are chosen with equal
probability and the counter is decreased by 1.
If the counter is zero then t is the unique successor.
In state r′ the counter is deterministically decreased until it becomes
zero, and then one goes to state s.
The color function is
Col(s)=1, Col(r)=Col(r′)=2 and Col(t)=3.
If one is in state r with counter value n, then the probability
of seeing state t before returning to state s is 2−n.
∎
Theorem 3.
*
Let φ=Safety(T).
There exists an infinitely branching MDP M with initial state s such that*
for all FR-strategies σ, we have PM,s,σ(φ)=0,
for every c∈[0,1), there exists an HD-strategy σ such that PM,s,σ(φ)≥c.
*Hence, ϵ-optimal strategies for safety require infinite memory.
*
Proof.
Consider the MDP M shown (on the left side) in Figure 3 where
S□={s,t} and S◯={ri}i≥1.
The state s is infinitely branching: s⟶ri for all i≥1.
For all random states ri, there are two successors
P(ri,t)=2−i and P(ri,s0)=1−2−i. The state t is a sink state.
Let σ be an arbitrary FR-strategy.
By definition there is a transducer T with
finite memory M and initial mode m0 such that σT=σ.
Let MT be the Markov chain obtained by the product of the MDP M and the transducer T.
The set of states in MT is M×S.
We denote by PMT,q(R) the probability of a measurable set R of infinite paths (i.e., infinite plays),
starting in the state q of MT.
Below, we prove that PM,s,σ(Safety({t}))=0.
Equivalently, we show that PMT,(m0,s)(Safety(M×{t}))=0.
We proceed in the following three steps: We will show, using strong fairness of probabilistic choices
in Markov chains, that for all modes m∈M,
[TABLE]
Since the memory of strategy σ is finite (∣M∣<∞), we will show that
[TABLE]
Moreover, we will prove that
[TABLE]
Using (4)-(6), we complete the proof as follows:
[TABLE]
As a result, PM,s0,σ(Safety({t}))=0. Below we prove (4), (5) and (6).
To establish (4), let (m,s)∈M×{s} be some state in the Markov chain MT.
For the case PMT,(m0,s)(GF(m,s))=0, we trivially have (4).
Therefore, we assume that PMT,(m0,s)(GF(m,s))>0.
So,
there exists an infinite path satisfying GF(m,s).
Hence, there exists a finite path π from (m,s) to itself.
By the structure of the chain, π visits some state (mri,ri).
In M, for all random states ri∈S◯, the only successors are s and t.
Hence, from all states (m,ri)∈M×S◯ in MT,
there is some successor q∈M×{t}.
It implies that πt=(m,s)(mri,ri)(mt,t)
is a finite path in MT, that starts in (m,s) with positive probability.
By strong fairness of probabilistic choices,
PMT,(m0,s)(Fπt)=PMT,(m0,s)(GF(m,s)).
As a result, we conclude (4).
To establish (5), we observe that
the inclusion ⊇ is trivial. To show ⊆, let π∞∈[[GF(M×{s0})]]
be an infinite path in the chain. As π∞ visits infinitely many elements from the finite set M×{s0},
there must exist some element (m,s0) that is visited infinitely often. Hence,
π∞∈[[⋁m∈MGF(m,s0)]], which gives the inclusion and thus (5).
To establish (6),
note that in M, all successors of s are from random states, from which the only successors are s and t.
Hence, for all infinite plays π∞∈[[F(M×{t})]] in MT,
π∞ alternates between some states from M×{s} and next some state from M×S◯.
It implies that π∈[[GF(M×{s})]].
Now, let c∈[0,1). Let n be such that c≤1−2n1. We construct an HD-strategy σn such that
[TABLE]
For all partial plays ρ, we define
σn(ρ)=rn+k where k is the number of times ρ has visited s.
Intuitively, upon the k-th visit to s, the strategy σn chooses
the transition s⟶rn+k.
For all k≥1, let Ek be the event of visiting t after the k-th visit of s, defined as follows
[TABLE]
Observe that [[F{t}]] is the disjoint union of all Ek events.
Hence, we have
[TABLE]
This proves that PM,s,σn(Safety({t}))=1−PM,s,σn(F{t})≥1−2n1≥c.
∎
Theorem 4.
*
Let φ={0,1}-Parity be the co-Büchi objective.
There exists an infinitely branching MDP M with initial state s such that*
for all FR-strategies σ, we have PM,s,σ(φ)=0,
there exists an HD-strategy σ such that PM,s,σ(φ)=1.
*Hence, optimal (and even almost-surely winning) strategies and ϵ-optimal strategies for co-Büchi require infinite memory.
*
Proof.
Consider the MDP M shown (on the right side) in Figure 3 where
S□={s,t} and S◯={ri}i≥1.
The state s is infinitely branching: s⟶ri for all i≥1.
For all random states ri, there are two successors
P(ri,t)=2−i and P(ri,s)=1−2−i. The state t has the unique successor s.
Let σ be an arbitrary FR-strategy.
By definition there is a transducer T with
finite memory M and initial mode m0 such that σT=σ.
Let MT be the Markov chain obtained by the product of the MDP M and the transducer T.
The set of states in MT is M×S.
We define a coloring function for MT such that it ignores the memory mode and assigns to (m,s)
the same color as state s in M. In particular, all states q∈M×{t} have color 1.
We use the same notation Col for the coloring functions of both M and MT.
We denote by PMT,q(R) the probability of a measurable set R of infinite paths (i.e., infinite plays),
starting in the state q of MT.
We prove that PM,s,σ(FG[S]Col=1)=0.
Equivalently, we show that PMT,(m0,s)(FG[M×S]Col=1)=0.
We proceed in three steps:
we will show, using strong fairness of probabilistic choices
in Markov chains, that for all modes m∈M,
[TABLE]
Since the memory of strategy σ is finite (∣M∣<∞), we will show that
[TABLE]
Moreover, we show that
[TABLE]
Using (7), (8) and (9), we complete the proof as follows:
[TABLE]
As a result, PM,s,σ(φ)=0. Below we prove (7), (8) and (9).
To establish (7), let (m,s)∈M×{s} be some state in the Markov chain MT.
For the case PMT,(m0,s)(GF(m,s))=0, we trivially have (7).
Therefore, we assume that PMT,(m0,s)(GF(m,s))>0.
So,
there exists an infinite path satisfying GF(m,s).
Hence, there exists a finite path π from (m,s) to itself.
By the structure of the chain, π visits some state (mri,ri).
Consider the MDP M, for all random states ri∈S◯, the only successors are s and t.
Hence, from all states (m,ri)∈M×S◯ in MT,
there is some successor q such that q∈M×{t}.
It implies that πt=(m,s)(mri,ri)(mt,t)
is a finite path in MT, that starts in (m,s) with positive probability.
By strong fairness of probabilistic choices,
PMT,(m0,s)(GFπt)=PMT,(m0,s)(GF(m,s)).
As a result, we conclude (7).
To establish (8), we observe that
the inclusion ⊇ is trivial. To show ⊆, let π∞∈[[GF(M×{s})]]
be an infinite path in the chain. As π∞ visits infinitely many elements from the finite set M×{s},
there must exist some element (m,s) that is visited infinitely often. Hence,
π∞∈[[⋁m∈MGF(m,s)]], which gives the inclusion and thus (8).
To establish (9),
note that in M, all infinite plays must visit s infinitely often.
Thus all runs in MT must visit M×{s}
infinitely often.
In particular, this holds for those infinite
runs π∞∈[[FG[M×S]Col=1]] in MT.
Thus π∞∈[[GF(M×{s})]].
Now we construct an HD-strategy σh such that PM,s,σh(φ)=1.
For all partial plays ρ, we define
σh(ρ)=rk where k is the number of times ρ has visited s.
Intuitively, upon the k-th visit to s, the strategy σh chooses the
transition s⟶rk.
Below we argue that PM,s,σh(GF[S]Col=1)=0,
which proves that PM,s,σh(φ)=1.
We define the sequence of events Ek
of visiting t between the k-th and k+1st visits of s.
For k≥1, let
[TABLE]
Observe that
[TABLE]
We use the
Borel-Cantelli lemma to prove that infinitely many of Ek’s occur with zero probability,
that is the probability of GF[S]Col=1.
In fact, by construction of σh, observe that PM,s,σh(Ek)=2−k.
Consequently, we have
[TABLE]
By the Borel-Cantelli lemma, we then have PM,s,σh(GF[S]Col=1)=0, and thus
PM,s,σh(φ)=1.
The proof is complete.
∎
-B Proofs of Section IV
Lemma 20**.**
Let φ be an objective that is prefix-independent in a class C of MDPs.
Let M=(S,S□,S◯,⟶,P)∈C, and s0∈S, and σ be a strategy with PM,s0,σ(φ)=valM(s0).
Suppose that s0s1⋯sn for some n≥0 is a partial play starting in s0 and induced by σ.
Then:
-
valM(sn)=PM,s0,σ([[φ]]s0∣s0s1⋯snSω).
2. 2.
If sn∈S◯ then valM(sn)=∑sn+1∈SP(sn)(sn+1)⋅valM(sn+1).
3. 3.
If sn∈S□ then valM(sn)=valM(sn+1) for all sn+1∈supp(σ(s0s1⋯sn)).
Proof.
First we show PM,s0,σ([[φ]]s0∣s0s1⋯snSω)≤valM(sn).
Define a strategy σ′:S∗S□→D(S) by σ′(w)=σ(s0s1⋯sn−1w) for all w∈S∗S□.
Then we have PM,s0,σ([[φ]]s0∣s0s1⋯snSω)=PM,sn,σ′([[φ]]sn)≤valM(sn).
Next we show valM(sn)≤PM,s0,σ([[φ]]s0∣s0s1⋯snSω).
Towards a contradiction, suppose that valM(sn)>PM,s0,σ([[φ]]s0∣s0s1⋯snSω).
Then, by the definition of valM(sn), there is a strategy σ′ with PM,sn,σ′([[φ]]sn)>PM,s0,σ([[φ]]s0∣s0s1⋯snSω).
Define a strategy σ′′ that plays according to σ;
if and when partial play s0s1⋯sn is played, then σ′′ acts like σ′ henceforth;
otherwise σ′′ continues with σ forever.
Using prefix-independence we get:
[TABLE]
This contradicts the definition of valM(s0).
Hence we have shown item 1.
Towards items 2 and 3, we extend σ:S∗S□→D(S) to σ:S∗S→D(S) by defining σ(ws)=P(s) for w∈S∗ and s∈S◯.
Then we have for all sn+1∈S:
[TABLE]
Further we have:
[TABLE]
Thus we have shown item 2.
Towards item 3, suppose sn∈S□.
Then prefix-independence implies valM(sn)≥valM(sn+1) for all sn+1 with sn⟶sn+1.
Since σ(s0s1⋯sn) is a probability distribution, the equality chain above shows that valM(sn)=valM(sn+1) for all sn+1∈supp(σ(s0s1⋯sn)).
Thus we have shown item 3.
∎
Lemma 6.
*
Let φ be an objective that is prefix-independent in a class C of MDPs.
Let M=(S,S□,S◯,⟶,P)∈C. Construct an MDP M∗=(S∗,S∗□,S∗◯,⟶∗,P∗) by setting*
[TABLE]
and S∗□=S∗∩S□ and S∗◯=S∗∩S◯ and
[TABLE]
and P∗:S∗◯→D(S∗) so that
[TABLE]
for all s∈S∗◯ and t∈S∗ with s⟶∗t. Then:
-
For all σ∈ΣM∗ and all n≥0 and all s0,…,sn∈S∗ with s0⟶∗s1⟶∗⋯⟶∗sn:
[TABLE]
2. 2.
For all s0∈S∗ and all σ∈ΣM with PM,s0,σ(φ)=valM(s0)>0 and all measurable R⊆s0Sω we have
PM∗,s0,σ(R)=PM,s0,σ(R∣[[φ]]s0).
Proof.
Note that by Lemma 20.2 we have that P∗(s) is a probability distribution for all s∈S∗◯; hence the MDP M∗ is well-defined.
We prove item 1 by induction on n.
For n=0 it is trivial.
For the step, suppose that the equality in item 1 holds for some n.
If sn∈S∗◯ then we have:
[TABLE]
Let now sn∈S∗□.
If σ(s0s1…sn)(sn+1)=0 then the inductive step is trivial.
Otherwise we have:
[TABLE]
This completes the inductive step, and we have proved item 1.
Towards item 2, let s0∈S∗ and σ∈ΣM such that PM,s0,σ(φ)=valM(s0)>0.
Observe that σ can be applied also in the MDP M∗.
Indeed, for any s∈S∗□, if t is a possible successor state of s under σ, then valM(s)=valM(t) by Lemma 20.3 and thus t∈S∗.
Let again n≥0 and s0,s1,…,sn∈S.
Suppose s0s1⋯sn is a partial play in M∗ induced by σ.
Then we have:
[TABLE]
Suppose s0s1⋯sn is not a partial play in M∗ induced by σ.
Hence PM∗,s0,σ(s0s1⋯snSω)=0.
If s0s1⋯sn is not a partial play in M induced by σ then PM,s0,σ(s0s1⋯snSω)=0.
Otherwise, since σ is optimal, there is i≤n with valM(si)=0, hence PM,s0,σ([[φ]]s0∩s0s1⋯snSω).
In either case we have
PM∗,s0,σ(s0s1⋯snSω)⋅PM,s0,σ(φ)=0=PM,s0,σ([[φ]]s0∩s0s1⋯snSω).
In either case we have the equality PM∗,s0,σ(R)=PM,s0,σ(R∣[[φ]]s0) for cylinders R=s0s1⋯snSω.
Since probability measures extend uniquely from cylinders [5], the equality holds for all measurable R⊆s0Sω.
Thus we have shown item 2.
∎
Lemma 8.
*
Let S be countable and s∈S.
Call a set of the form swSω for w∈S∗ a cylinder.
Let P,P′ be probability measures on sSω defined in the standard way, i.e., first on cylinders and then extended to all measurable sets R⊆sSω.
Suppose there is x≥0 such that x⋅P(C)≤P′(C) for all cylinders C.
Then x⋅P(R)≤P′(R) holds for all measurable R⊆sSω.
*
Proof.
Let C={C⊆sSω∣C cylinder} denote the class of cylinders.
This class generates an algebra C∗⊇C, which is the closure of C under finite union and complement.
The classes C and C∗ generate the same σ-algebra σ(C).
The class C∗ is the set of finite disjoint unions of cylinders [5, Section 2].
Hence x⋅P(R)≤P′(R) for all R∈C∗.
Define
[TABLE]
We have C⊆C∗⊆Q⊆σ(C).
We show that Q is a monotone class, i.e., if R1,R2,…∈Q, then R1⊆R2⊆⋯ implies ⋃iRi∈Q, and R1⊇R2⊇⋯ implies ⋂iRi∈Q.
Suppose R1,R2,…∈Q and R1⊆R2⊆⋯.
Then:
[TABLE]
So ⋃iRi∈Q.
Using the fact that measures are continuous from above, one can similarly show that if R1,R2,…∈Q and R1⊇R2⊇⋯ then ⋂iRi∈Q.
Hence Q is a monotone class.
Now the monotone class theorem (see, e.g., [5, Theorem 3.4]) implies that σ(C)⊆Q, thus Q=σ(C).
Hence x⋅P(R)≤P′(R) for all R∈σ(C).
∎
-C Proofs of Section V
Theorem 10.
*
Let M=(S,S□,S◯,⟶,P) be an MDP, and φ=Reach(T). Let s0∈S and σ be a strategy with PM,s0,σ(φ)=1.
Then there is an MD-strategy σ^ with PM,s0,σ^(φ)=1.
*
Proof.
We can assume that T={t} for some t∈S.
We can also assume that all states are almost-surely winning, since in order to achieve an almost-sure winning objective, the player must forever remain in almost-surely winning states.
Let ϵ1:=1/2.
By Theorem 9 there exists an MD-strategy σ1 such that PM,s0,σ1(φ)≥1−ϵ1.
In fact, by the proof of Theorem 9 there exists a finite subset V1⊆S such that PM,s0,σ1([[φ]]s0∩V1ω)≥1−ϵ1.
Let U1 denote the states that occur in those plays that are both contained in [[φ]]s0∩V1ω and induced by σ1.
Then PM,s0,σ1([[φ]]s0∩U1ω)=PM,s0,σ1([[φ]]s0∩V1ω)≥1−ϵ1.
By the definition of U1, for all s∈U1 the MD-strategy σ1 induces a play from s0 to t via s.
Hence we have PM,s,σ1([[φ]]s∩U1ω)>0.
Since U1⊆V1 is finite, there are c>0 and ℓ∈N such that for all s∈U1 we have PM,s,σ1(sU1≤ℓ−1{t}ω)≥c, i.e., from any state in U1 the probability that t is reached in ≤ℓ steps is at least c.
Consider the MDP M1 obtained from M by fixing σ1 on U1 (i.e., in M1 the states in U1 are random states).
We argue that all states are almost-surely winning in M1.
Indeed, let s∈S be any state.
Recall that s is almost-surely winning in M.
Define an HR-strategy σ in M1 as follows:
first play a strategy that is almost-surely winning for s in M;
if and when U1 is entered and then left again (entering some state s′∈S∖U1) then forget the history and play again a strategy that is almost-surely winning for s′ in M; and so forth.
This strategy σ reaches {t} with probability 1 whenever the play
stays outside of U1.
I.e., almost all plays that eventually always avoid U1 reach {t}.
Moreover, whenever the play enters U1, the probability that t is reached
in ≤ℓ steps is at least c, i.e., there is a uniform bound.
Thus almost all plays that enter U1 infinitely often reach {t}.
It follows that we have PM1,s,σ(φ)=1.
Now we repeat the argument, but with M1 instead of M and with ϵ2=1/4 instead of ϵ1.
This yields a set U2⊇U1 and an MD-strategy σ2 that agrees with σ1 on U1 so that PM,s0,σ2([[φ]]s0∩U2ω)=PM1,s0,σ2([[φ]]s0∩U2ω)=1−ϵ2.
Similarly as before, obtain an MDP M2 from M1 by fixing σ2 on U2.
Then repeat again, and so forth, with ϵi=1/2i for i=1,2,…
Define U:=⋃i≥1Ui.
Observe that on all s∈U almost all (i.e., all except finitely many) strategies σi agree.
Let σ^ be an MD-strategy that on all states in U agrees with almost all MD-strategies σi.
By our construction we have PM,s0,σ^([[φ]]s0∩Uω)≥1−ϵ for all ϵ>0.
Hence PM,s0,σ^([[φ]]s0∩Uω)=1.
∎
Proposition 11.*
Let M=(S,S□,S◯,⟶,P) be an MDP, and s0∈S, and σ a strategy, and Col:S→{1,2}, and φ=Parity(Col).
Suppose PM,s0,σ(φ)=1.
Then there is an MD-strategy σ′ with PM,s0,σ′(φ)=1.
*
Proof.
We can assume that all states are almost-surely winning, since in order to achieve an almost-sure winning objective, the player must forever remain in almost-surely winning states.
We provide an MD-strategy σ^ such that for all states s∈S we have PM,s,σ^(φ)=1.
Set φ′=Reach([S]Col=2).
Note that [[φ]]⊆[[φ′]].
Since all states are almost-surely winning for φ, all states are almost-surely winning for φ′.
By Theorem 10 and Lemma 7 there is an MD-strategy σ^ such that for all states s∈S we have PM,s,σ^(φ′)=1.
That is, σ^ reaches the set [S]Col=2 with probability 1, regardless of the start state.
It follows that it reaches, with probability 1, the set [S]Col=2 infinitely often.
Hence PM,s,σ^(φ)=1 holds for all s∈S.
∎
-D Proofs of Subsection VI-A
Proposition 13.*
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP, and T⊆S, and φ=Safety(T).
Define an MD-strategy σopt\mathchar45av (for “optimal avoiding”) that, in each state s, picks a successor state with the largest value valM(s)=supσ∈ΣPM,s,σ(φ).
Then for all states s∈S we have
PM,s,σopt\mathchar45av(φ)=valM(s), i.e.,
σopt\mathchar45av is uniformly optimal.
*
Proof.
We can assume that T is a sink.
Fix a state s0.
Write s0s1s2⋯∈sSω for a random run, i.e., s1,s2,… denote random states.
For any n∈N let [Xn¬T]:s0Sω→{0,1} be the random variable that indicates if sn∈T.
Note that [Xn¬T]≥valM(sn).
Writing EM,s0,σopt\mathchar45av for the expectation with respect to PM,s0,σopt\mathchar45av, we have:
[TABLE]
A straightforward induction on n, using the definition of σopt\mathchar45av, shows that EM,s0,σopt\mathchar45av(valM(sn))=valM(s0) holds for all n∈N.
Hence PM,s0,σopt\mathchar45av(φ)≥valM(s0).
The converse inequality holds by the definition of the value.
Hence PM,s0,σopt\mathchar45av(φ)=valM(s0).
∎
Lemma 15.
*
Let M=(S,S□,S◯,⟶,P) be a finitely branching MDP, and Col:S→N a color function.
Let s be a state, and σ a strategy, and τ<1.
Then PM,s,σ(FG¬Safe(τ)∧FG[S]Col=0)=0.
*
Proof.
For any n∈N define Zn=([S]Col=0)n.
That is, ZnSω is the event that the first n visited states have color [math].
For any state s∈Safe(τ), let n(s)∈N be the smallest number such that PM,s,σ(Zn(s)Sω)≤(1+τ)/2.
This is well-defined.
Let L⊆S∗ be the set of finite sequences s0s1⋯sn−1 such that s0∈Safe(τ) and n=n(s0) and ∀i<n.si∈[S]Col=0∖Safe(τ).
We show for all s∈S∖Safe(τ) and all k∈N that PM,s,σ(LkSω)≤(21+τ)k.
We proceed by induction on k.
The case k=0 is trivial.
For the induction step, let k≥0.
We have:
[TABLE]
This completes the induction proof.
Write φ:=G¬Safe(τ)∧G[S]Col=0.
We have for all s∈S:
[TABLE]
It follows:
[TABLE]
Thus we have:
[TABLE]
∎
-E Proofs of Subsection VI-B
Lemma 18.
*
Let P be a probability measure over the sample space Ω.
Let (Ri)i∈I be a countable partition of Ω in measurable events.
Let E⊆Ω be a measurable event.
Suppose P(Ri∩E)=P(Ri) holds for all i∈I.
Then P(E)=1.
*
Proof.
We have:
[TABLE]
∎