Positivity-hardness results on Markov decision processes
Jakob Piribauer1
Christel Baier1
(1Technische Universität Dresden)
Abstract
This paper investigates a series of optimization problems for one-counter Markov decision processes (MDPs) and integer-weighted MDPs with finite state space. Specifically, it considers problems addressing termination probabilities and expected termination times for one-counter MDPs, as well as satisfaction probabilities of energy objectives, conditional and partial expectations, satisfaction probabilities of constraints on the total accumulated weight, the computation of quantiles for the accumulated weight, and the conditional value-at-risk for accumulated weights for integer-weighted MDPs. Although algorithmic results are available for some special instances, the decidability status of the decision versions of these problems is unknown in general.
The paper demonstrates that these optimization problems are inherently mathematically difficult by providing polynomial-time reductions from the Positivity problem for linear recurrence sequences. This problem is a well-known number-theoretic problem whose decidability status has been open for decades and it is known that decidability of the Positivity problem would have far-reaching consequences in analytic number theory. So, the reductions presented in the paper show that an algorithmic solution to any of the investigated problems is not possible without a major breakthrough in analytic number theory.
The reductions rely on the construction of MDP-gadgets that encode the initial values and linear recurrence relations of linear recurrence sequences.
These gadgets can flexibly be adjusted to prove the various Positivity-hardness results.
Interestingly, the reductions can also be extended to demonstrate the Positivity-hardness of two problems that address the long-run behavior of a system, namely the model-checking problem of frequency-LTL and the optimization of the long-run average probability to satisfy a path property (long-run probability). Both of these problems have been studied before on special instances, but are open in general.
1 Introduction
When modelling and analyzing computer systems and their interactions with their environment, two qualitatively different kinds of uncertainty about the evolution of the system execution play a central role: non-determinism and probabilism.
If a system is, for example, employed in an unknown environment or depends on user inputs or concurrent processes, modelling the system as non-deterministic accounts for all possible external influences, sequences of user inputs, or possible orders in which concurrent events take place. If transition probabilities between the states of a system, such as the failure probability of components or the probabilities in a probabilistic choice employed in a randomized algorithm, are known or can be estimated, it is appropriate to model this behavior as probabilistic. A pure worst- or best-case analysis is not very informative in such cases and the additional probabilistic information available should be put to use.
Markov decision processes (MDPs) are a standard operational model combining non-deterministic and probabilistic behavior and are widely used in operations research, artificial intelligence, and verification among others.
In each state of an MDP, there is a non-deterministic choice from a set of actions. Each action specifies a probability distribution over the possible successor states according to which a transition is chosen randomly.
Typical optimization problems on MDPs require to resolve the non-deterministic choices by specifying a scheduler such that a quantitative objective function is optimized.
For example, the standard model-checking problem asks for the minimal or maximal probability that an execution satisfies a given linear-time property. Here, minimum and maximum range over all resolutions of the non-deterministic choices, i.e., over all schedulers.
This model-checking problem is known to be 2EXPTIME-complete if the property is given in linear temporal logic (LTL) [CY95] and solvable in polynomial time if the property is given by a deterministic automaton [dA99, BK08].
Many quantitative aspects of a system can be modeled by equipping an MDP with weights that are collected in each step. These weights might represent time, energy consumption, utilities, or generally speaking any sort of costs or rewards incurred. Classical optimization problems in this context that are known to be solvable in polynomial time include the optimization of the expected value of
the total accumulated weight before a target state is reached, the so-called stochastic shortest path problem (SSPP) [BT91, dA99, BBD*+*18], the expected value of the reward earned in average per step, the so-called expected mean payoff or long-run average, or the expected discounted accumulated weight where after each step a discount factor is applied to all future weights (for the latter two, see, e.g., [HK79, Put94]).
Of course, there is a vast landscape of further optimization problems on finite-state MDPs that have been analyzed.
We are, nevertheless, not aware of natural decision problems for standard (finite-state) MDPs with a single weight function and single objective that are known to be undecidable.
Undecidability results have been established for more expressive models. This applies, e.g., to recursive MDPs [EY05], MDPs with two or more weight functions [BKKW14, RRS17], or partially observable MDPs [MHC99, BGB12].
In this paper, we will investigate a series of optimization problems that have been studied in the literature, but are open in general.
We will show that these problems possess an inherent mathematical difficulty that makes algorithmic solutions without a major breakthrough in analytic number theory impossible.
Formally, this result is obtained by reductions from the Positivity problem for linear recurrence sequences, a number theoretic problem whose decidability status has been open for many decades.
1.1 Positivity problem
Definition 1.1** (Positivity problem).**
The Positivity problem for linear recurrence sequences asks whether such a sequence stays non-negative. More formally,
given a natural number k≥2, and rationals αi and βj with 1≤i≤k and 0≤j≤k−1, let (un)n≥0 be defined by the initial values u0=β0, …, uk−1=βk−1 and the linear recurrence relation
[TABLE]
for all n≥0. The Positivity problem asks to decide whether un≥0 for all n.111We do not distinguish between the Positivity problem and its complement in the sequel. So, we also refer to the problem whether there is an n such that un<0 as the Positivity problem. The number k is called the order of the linear recurrence sequence.
The Positivity problem is closely related to the famous Skolem problem. The Skolem problem asks whether there is an n such that un=0 for a given linear recurrence sequence (un)n≥0. It is well-known that the
Skolem problem is polynomial-time reducible to the Positivity problem (see, e.g., [EvdPSW03]).
The Positivity problem and the Skolem problem are outstanding problems in the fields of number theory and theoretical computer science (see, e.g., [HHHK05, OW12, OW15]) and their decidability has been open for many decades.
Deep results establish decidability for both problems for linear recurrence sequences of low order or for restricted classes of sequences
[STM84, Ver85, OW14a, OW14b, OW14c]. A proof of decidability or undecidability of the Positivity problem for arbitrary sequences, however, withstands all known number-theoretic techniques. In [OW14b], it is shown that decidability of the Positivity problem (already for linear recurrence sequences of order 6) would entail a major breakthrough
in the field of Diophantine approximation of transcendental numbers, an area of analytic number theory.
We call a problem to which the Positivity problem is reducible Positivity-hard.
From a complexity theoretic point of view, the Positivity problem is known to be at least as hard as the decision problem for the universal theory of the reals [OW14c], a problem known to be coNP-hard and to lie in PSPACE [Can88]. As most of the problems we will address are PSPACE-hard, the reductions in this paper do not provide new lower bounds on the computational complexity.
The hardness results in this paper hence refer to the far-reaching consequences on major open problems that a decidability result would imply.
Furthermore, of course the undecidability of the Positivity problem would entail the undecidability of any Positivity-hard problem.
1.2 Problems under investigation and related work on these problems
In the sequel, we briefly describe the problems studied in this paper and describe related work on these problems. In general, the decidability status of all of these problems is open and we will prove them to be Positivity-hard.
Energy objectives, one-counter MDPs, and quantiles.
If weights model a resource like energy that can be consumed and gained during a system execution, a natural problem is to determine the worst- or best-case probability that the system never runs out of the resource. This is known as the energy objective.
There has been work on combinations of the energy objective with further objectives such as parity objectives [CD11, MSTW17] and expected mean payoffs [BKN16]. Previous work on this objective focused on the possibility to satisfy the objective (or the combination of objectives) almost surely.
The quantitative problem whether it is possible to satisfy an energy objective with probability greater than some threshold p is open.
The complement of the energy objective can be found in the context of one-counter MDPs (see [BBE*+*10, BBEK11, BKNW12]): Equipping an MDP with a counter that can be increased and decreased can be used to model a simple form of recursion and a can be seen as a special case of pushdown MDPs. The process is said to terminate as soon as the counter value drops below [math] and the standard task is to compute maximal or minimal termination probabilities. In one-counter MDPs that terminate almost surely, one furthermore can ask for the extremal expected termination times, i.e. expected number of steps until termination.
While it is decidable for one-counter MDPs whether the maximal termination probability is 1 in polynomial time and in exponential time if termination is required to occur inside a specified set of states [BBE*+*10],
the computation of the optimal value and the quantitative decision problem whether the optimal value exceeds a threshold p are left open in the literature. Also the problem to compute the minimal or maximal expected termination time of a one-counter MDP that terminates almost surely under any scheduler is open.
There are, however, approximation algorithms for the optimal termination probability [BBEK11]
and for the expected termination time of almost surely terminating one-counter MDPs [BKNW12].
One-counter MDPs can be seen as a special case of
recursive MDPs [EY15]. For general recursive MDPs, the qualitative decision problem whether the maximal termination probability is 1 is undecidable while for restricted forms, so-called 1-exit recursive MDPs, the qualitative and also the quantitative problem is decidable in polynomial space [EY15]. One-counter MDPs can be seen as a special case of 1-box recursive MDPs in the terminology of [EY15], a restriction orthogonal to 1-exit recursive MDPs.
The termination probability of one-counter MDPs and the satisfaction probability of the energy objective are closely related to the computation of quantiles (see [UB13, BDD*+*14, RRS17]).
Given a probability value p, here the task is to compute the best bound b such that the maximal or minimal probability that the accumulated weight exceeds the bound is at most or at least p. The decision version whether the maximal or minimal probability that the accumulated weight before reaching a target state exceeds b is at least or at most p is also known as the cost problem (see [HK15, HKL17, BBD*+*18]).
The computation of quantiles and the cost problem have been addressed for MDPs with non-negative weights and are solvable in exponential time in this setting
[UB13, HK15]. The decision version of the cost problem with non-negative weights is furthermore PSPACE-hard for a single inequality on the accumulated weight and EXPTIME-complete if a Boolean combination of inequality constraints on the accumulated weight is considered [HK15].
For the setting with arbitrary weights,
[BBD*+*18] provides solutions to the qualitative question whether a constraint on the accumulated weight is satisfied with probability 1 (or >0). Further,
it is known that the quantitative problem is undecidable if multiple objectives with multiple weight functions have to be satisfied simultaneously [RRS17].
Non-classical stochastic shortest path problems (SSPPs).
The classical SSPP described above requires that a goal state is reached almost surely. In many situations, however, there might be no schedulers reaching the target with probability 1 or schedulers that miss the target with positive probability are of interest, too.
Two non-classical variants that drop this requirement are the conditional SSPP (see [BKKW17, PB19]) and the partial SSPP (see [CFK*+*13a, CFK*+*13b]). In the conditional SSPP, the goal is to optimize the conditional expected accumulated weight before reaching the target under the condition that the target is reached. In other words, the average weight of all paths reaching the target has to be optimized.
In the partial SSPP, paths not reaching the target are not ignored, but assigned weight [math].
Possible applications for these non-classical SSPPs include the analysis of probabilistic programs where no guarantees on almost sure termination can be given (see, e.g., [GKM14, KGJ*+*15, BEFH16, CFG16, OGJ*+*18]), the analysis of fault-tolerant systems where error scenarios might occur with small, but positive probability, or the trade-off analysis with conjunctions of utility and cost constraints that are achievable with positive probability, but not almost surely (see, e.g., [BDK*+*14]).
In [CFK*+*13a] and [BKKW17], partial and conditional expectations, respectively, have been addressed in the setting of non-negative weights. In both-cases, the optimal value can be computed in exponential time
[CFK*+*13a, BKKW17] while the threshold problem is PSPACE-hard [PB19, BKKW17]. In MDPs with positive and negative weights, it is known that the optimal values might be irrational and that optimal schedulers might require infinite memory [PB19].
Conditional expectations also play an important role for some risk measures. The conditional value-at-risk (CVaR) is an established risk measure (see, e.g., [Ury00, AT02]) defined as the conditional expected outcome under the condition that the outcome belongs to the p worst outcomes for a given probability value p.
In the context of optimization problems on weighted MDPs, the CVaR has been studied for mean-payoffs and weighted reachability where only one terminal weight is collected per run (see [KM18]), and for the accumulated weight before reaching a target state in MDPs with non-negative weights (see [ADBA21]).
The CVaR for accumulated weights can be optimized in MDPs with non-negative weights in exponential time [PB20, Meg22].
Long-run properties and frequency-LTL over MDPs.
Besides encoding quantitative features of a model into a weight-structure,
a further branch of research addresses ways to quantify the degree to which a specification is satisfied by a model (see [Hen13] for an overview of the field).
One line of research in this direction attempts to measure the degree to which a
specification is satisfied when evolving over time. This includes the work on frequency-LTL [BDL12, FK15, FKK15].
In frequency-LTL, temporal operators are relaxed by frequency constraints. Under a relaxed ‘globally’-operator with a rational lower frequency bound q, a formula does not have to hold on all suffixes, but the frequency of suffixes that satisfy a formula
has to be at least q.
Similarly, long-run probabilities (see [BBPS19]) capture the average probability that a system will satisfy a property when we start to observe it after many steps.
Frequency-LTL and long-run probabilities can be useful for the analysis of the properties of systems in the long-run equilibrium after some initialization phase. This is helpful, e.g., to quantify the availability of system components. For a case study in this direction employing probabilistic model checking to analyze system availability, see
[LPM*+*15].
We address the model-checking problem of frequency-LTL in MDPs and the optimization of long-run probabilities of simple co-safety properties in this paper.
In [BBPS19], it is shown that for several types of properties including Street and Rabin conditions, the long-run probability in MDPs can be optimized in polynomial time. For constrained reachability (aUb), however,
the threshold problem is shown to be NP-hard while the optimal value can be computed in exponential time.
The model-checking problem for the full logic frequency-LTL on MDPs is open, but
fragments for which the model-checking problem on MDPs is decidable have been identified [FK15, FKK15]. In particular, the model-checking problem is decidable if the until operator is not allowed in the scope of (frequency-)globally-operators [FKK15].
1.3 Contribution
We develop a technique to provide reductions from the Positivity problem to threshold problems on MDPs, asking whether the optimal value of a quantity exceeds a given rational threshold. The resulting reductions are based on the construction of MDP-gadgets that allow to encode the linear recurrence relation of a linear recurrence sequence and the initial values, respectively.
The approach turns out to be quite flexible. By adjusting the gadgets encoding initial values, we can provide reductions of the same overall structure for several of the optimization problems we discussed. Through further chains of reductions depicted in Figure 1, we establish Positivity-hardness for the full series of optimization problems under investigation. The main result of this paper consequently is the following:
Main result**.**
The Positivity problem is polynomial-time reducible to the threshold problems for the optimal values of the following quantities:
termination probabilities of one-counter MDPs,
expected termination times of almost surely terminating one-counter MDPs,
the satisfaction probabilities of energy objectives in MDPs with weights in Z,
the probability to satisfy an inequality on the accumulated weight (cost problem) in MDPs with weights in Z,
conditional expectations (conditional SSPP) in MDPs with weights in Z,
partial expectations (partial SSPP) in MDPs with weights in Z,
conditional values-at-risk for accumulated weights (before reaching a goal) in MDPs with weights in Z,
a two-sided version of partial expectations in MDPs with two non-negative weight functions with values in N, and
long-run probabilities of regular co-safety properties in MDPs.
Furthermore, an algorithm for
the computation of quantiles for accumulated weights in MDPs with weights in Z, or
the model-checking problem of frequency-LTL (as defined in [FK15, FKK15]) on MDPs
would imply the decidability of the Positivity-problem.
1.4 Related work on Skolem- and Positivity-hardness in verification
In
[AAOW15], the Skolem-hardness of decision problems
for Markov chains has been established. The problems studied in [AAOW15] are (1) to decide whether for given states s, t
and rational number
p, there is a positive integer n such that the
probability to reach t from s in n steps equals p
and (2)
the model checking problem for a probabilistic variant of monadic logic
and a variant of LTL that treats Markov chains as linear transformers of
probability distributions. A connection between similar problems and the Skolem problem has also been conjectured in [BRS06, AAGT15].
These decision problems are of quite different nature than the
problems studied here, and so are the reductions from the Skolem problem, because the behavior of a Markov chains in n steps can directly be expressed by Pn where P is the transition probability matrix due to the lack of non-determinism.
In this context also the results of
[COW16] and [MSS20] are remarkable
as they show the decidability, subject to Schanuel’s conjecture, of
reachability problems in
continuous linear dynamical systems and continuous-time MDPs, respectively,
as instances of the continuous Skolem problem.
In other areas of formal verification, the Skolem problem and the Positivity problem play an important role in the context of the
termination of linear programs [BAGM12, Tiw04, Bra06, OW15].
The Positivity-hardness results leave the possibility open that the problems under consideration are undecidable.
Remarkable undecidability results in this context are presented in [KK15]:
The hardness of deciding almost sure termination and almost sure termination with finite expected termination time for purely probabilistic programs formulated in the probabilistic fragment of probabilistic guarded command language (pGCL) [MMM05] is pinpointed to levels of the arithmetical hierarchy (for details on the arithmetical hierarchy, see, e.g., [Odi92]).
The results reach up to Π30-completeness for deciding universal almost sure termination with finite expected termination time (Π10-complete problems are already undecidable while still co-recursively enumerable). Undecidability is not surprising as the programs subsume ordinary programs. But the universal halting problem for ordinary programs is only Π20-complete showing that deciding universal termination with finite expected termination time of probabilistic programs is strictly harder. Similarly deciding termination from a given initial configuration is Σ10-complete for ordinary programs (halting problem) while deciding almost sure termination with finite expected termination time for probabilistic programs from a given initial configuration is Σ20-complete.
Operational semantics of pGCL-programs can be given as infinite-state MDPs [GKM14]. Applied to the purely probabilistic fragment, this leads to infinite-state Markov chains.
1.5 Outline
In the following Section 2, we provide necessary definitions and present our notation.
In Section 3, we outline the general structure of the gadget-based reductions from the Positivity-problem and construct
an MDP-gadget in which a linear recurrence relation can be encoded in terms of the optimal values for a variety of optimization problems (Section 3.2).
Afterwards, we construct gadgets encoding also the initial values of a linear recurrence sequence and provide the reductions from the Positivity problems and all subsequent reductions as depicted in Figure 1 (Section 4).
We conclude with final remarks and an outlook on future work (Section 5).
1.6 Note on the publication status of the results
This paper is an extension of work published at ICALP 2020 [PB20]. It extends the conference version by the results for one-counter MDPs, energy objectives, quantiles, and cost problems.
These additional results are also presented in the PhD thesis [Pir21].
Furthermore, full proofs omitted in the conference version and a detailed and improved description of all constructions is provided.
2 Preliminaries
We assume some familiarity with Markov decision processes and briefly introduce our notation in the sequel. More details can be found in text books such as [Put94].
Markov decision process.
A Markov decision process (MDP) is a tuple M=(S,Act,P,sinit)
where
S is a finite set of states,
Act is a finite set of actions,
P:S×Act×S→[0,1]∩Q is the
transition probability function for which we require that
∑t∈SP(s,α,t)∈{0,1}
for all (s,α)∈S×Act, and
sinit∈S is the initial state.
Depending on the context, we enrich MDPs with
a weight function wgt:S×Act→Z,
a finite set of atomic propositions AP and a labeling function
L:S→2AP, or
a designated set of goal states Goal.
The size of an MDP M, denoted by size(M),
is the sum of the number of states
plus the total sum of the logarithmic lengths of the non-zero
probability values
P(s,α,s′) as fractions of co-prime integers and, if present, the logarithmic lengths of the weight values wgt(s,α).
We write Act(s) for the set of actions that are enabled in a state s,
i.e., α∈Act(s) iff ∑t∈SP(s,α,t)=1. Whenever the process is in a state s, a non-deterministic choice between the enabled actions Act(s) has to be made.
We call a state absorbing if the only enabled actions lead to the state itself with probability 1 and weight [math]. If there are no enabled actions, we call a state a trap.
The paths of M are finite or
infinite sequences s0α0s1α1s2α2…
where states and actions alternate such that
P(si,αi,si+1)>0 for all i≥0. Throughout this section, we assume that all states are reachable from the initial state in any MDP, i.e., that there is a finite path from sinit to each state s.
We extend the weight function to finite paths.
For a finite path π=s0α0s1α1…αk−1sk,
we denote its accumulated weight by
[TABLE]
Similarly, we extend the transition probability function to finite paths and write
[TABLE]
A one-counter MDP is an MDP equipped with a counter. Each state-action pair increases or decreases the counter or leaves the counter unchanged. A one-counter MDP is said to terminate if the counter value drops below zero. We view one-counter MDPs as MDPs with a weight-function wgt:S×Act→{−1,0,+1}. In this formulation a one-counter MDP terminates when a prefix π of a path satisfies wgt(π)<0.
A Markov chain is an MDP in which the set of actions is a singleton. There are no non-deterministic choices in a Markov chain and hence we drop the set of actions. Consequently, a Markov chain is a tuple M=(S,P,sinit), possibly extended with a weight function, a labeling, or a designated set of goal states. The transition probability function P is a function from S×S to [0,1]∩Q such that ∑t∈SP(s,t)∈{0,1} for all s∈S.
Scheduler.
A scheduler for an MDP M=(S,Act,P,sinit)
is a function S that assigns to each finite path π not ending in trap state
a probability distribution over Act(last(π)) where last(π) denotes the last state of π. This probability distribution indicates which of the enabled actions is chosen with which probability under S after the process has followed the finite path π.
We allow schedulers to be randomized and history-dependent. By restricting the possibility to randomize over actions or by restricting the amount of information from the history of a run that can affect the choice of a scheduler, we obtain the following types of schedulers:
A scheduler S is called deterministic if it does not make use of the possibility to randomize over actions, i.e., if S(π) is a Dirac distribution
for each path π.
Such a scheduler S can be viewed as a function that assigns an action
to each finite path π.
A scheduler
S is called memoryless if S(π)=S(π′) for
all finite paths π, π′ with last(π)=last(π′).
In this case, S can be viewed as a function
that assigns to each state s a distribution over Act(s).
A memoryless deterministic scheduler hence can be seen as a function from states to actions.
In an MDP with a weight function, a scheduler S is said to be weight-based if
S(π)=S(π′) for all finite paths π, π′
with wgt(π)=wgt(π′) and last(π)=last(π′).
Such a scheduler assigns distributions over actions to state-weight pairs from S×Act.
Finally, let X be a finite set of memory modes with initial mode xinit and U:X×S×Act×S→X a memory update function.
From a finite path π=s0α0s1α1…αk−1sk we can extract a sequence of memory modes x0…xk. We let x0=xinit, and xi+1=U(xi,si,αi,si+1) for all i<k. Let us denote the last memory mode xk after the finite path π by U(xinit,π).
A scheduler S is a finite-memory scheduler if there is such a finite set of memory modes X with an initial mode xinit and an update function U such that S(π)=S(π′) for all finite paths π, π′
with U(xinit,π)=U(xinit,π′) and last(π)=last(π′).
Probability measure.
Given an MDP M=(S,Act,P,sinit) and a scheduler S, we obtain a probability measure PrM,sS on the set of maximal paths of M that start in s:
For each finite paths π=s0α0s1α1…αk−1sk with s0=s, we denote the cylinder set of all its maximal extensions by Cyl(π). The probability mass of this cylinder set is then given by
[TABLE]
Recall that S(s0…si) is a probability distribution over actions and that S(s0…si)(αi) denotes the probability that the scheduler S chooses action α after the prefix s0…si of π.
The set of cylinder sets forms the basis of the standard tree topology on the set of maximal paths. By Carathéodory’s extension theorem, we can extend the pre-measure PrM,sS(Cyl(π)) defined on the cylinder sets to a probability measure on the Borel σ-algebra of the space of maximal paths with the standard tree topology.
We sometimes drop the subscript s if s is the initial state sinit of M. In a Markov chain N, we drop the reference to a scheduler and write PrN,s.
Let X be a random variable on the set of maximal paths of M starting in s, i.e., X is a function assigning values from R∪{−∞,+∞} to maximal paths. We denote the expected value of X under the probability measure PrM,sS by EM,sS(X).
The values we are typically interested in are the worst- or best-case probabilities of an event or the worst- or best-case expected values of a random variable. Worst or best case refers to the possible ways to resolve the non-deterministic choices. Hence, these values are formally expressed by taking the supremum or infimum over all schedulers. Given an MDP M, a state s, and an event, i.e., a set of maximal paths, E, or a random variable X on the maximal paths of M, we define
[TABLE]
where inf and sup range over all schedulers S for M.
We use LTL-like notation such as “◊(accumulated weight <0)” to denote the event that a prefix of a path has a negative accumulated weight. Note that this event expresses the termination of a one-counter MDP in our view of one-counter MDPs as MDPs with a weight-function taking only values in {−1,0,+1}.
Classical stochastic shortest path problem.
Let M be an MDP with
a weight function wgt:S×Act→Z and
a designated set of terminal goal states Goal.
We define the following random variable \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal on maximal paths ζ of M as follows:
[TABLE]
The expected accumulated weight before reaching Goal under a scheduler S is given by the expected value EM,sinitS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal). It is evident that this expected value is only defined if PrM,sinitS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal)=1. The classical stochastic shortest path problem asks for the optimal value
[TABLE]
where the supremum ranges over all schedulers S with PrM,sinitS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal)=1.
The classical stochastic shortest path problem can be solved in polynomial time [BT91, dA99, BBD*+*18].
3 Outline of the Positivity-hardness proofs
The Positivity-hardness results in this paper are obtained by sequences of reductions depicted in Figure 1. The key steps for these sequences are the three direct reductions from the Positivity-problem to the threshold problems for the maximal termination probability of one-counter MDPs, the maximal partial expectation, and the maximal conditional value-at-risk, respectively.
3.1 Structure of the MDP constructed for the direct reductions from the Positivity problem
The three direct reductions from the Positivity problem (at the top of Figure 1) follow a modular approach: The MDPs constructed for the reductions are obtained by putting together three gadgets as sketched in Figure 2. One gadget encodes a linear recurrence relation exploiting the dependency of optimal values from different starting states after different amounts of weight have been accumulated in the history of a run onto each other. A second gadget encodes the initial values of a linear recurrence sequence. Together, these two gadget allow us to encode linear recurrence sequences. Finally, an initial gadget is added in which each positive amount of weight w is accumulated with positive probability. Afterwards, the gadget is left and a scheduler has to
decide how to leave the initial gadget. The optimal decision if weight w has been accumulated directly corresponds to whether the wth member of the given linear recurrence sequence is non-negative.
More precisely, let a rational linear recurrence sequence be given in terms of the initial values u0,…,uk−1 and the coefficients α1,…,αk of the linear recurrence relation.
The three gadgets are connected via two states s and t as depicted in Figure 2.
In state t and s, actions γ0,…,γk−1 and δ0,…,δk−1, respectively, leading to the gadget encoding the initial values and action γ and δ, respectively leading to the gadget encoding the linear recurrence relation are enabled. The gadgets will be constructed such that an optimal scheduler has to choose action γi or δi if the accumulated weight in state t or s is a value i with 0≤i<k and that it has to choose action γ if the accumulated weight is at least k.
After γ or δ is chosen, the accumulated weight is decreased within the gadget encoding the linear recurrence relation before the MDP moves back to the states s and t with positive probability.
Let us now denote the maximal possible value for the quantity of interest when starting in one of the states t and s with accumulated weight w by V(t,w) and V(s,w). The linear recurrence relation will be found in the difference d(w)=defV(t,w)−V(s,w). If the accumulated weight is 0≤i<k, the gadget encoding the initial values will make sure that d(i)=V(t,i)−V(s,i)=ui. For each of the three direct reductions from the Positivity problem, we construct one such gadget tailored to the three respective quantities.
For accumulated weights w of at least k, the gadget encoding the recurrence will exploit the dependency of the optimal values V(t,w) and V(s,w) on the optimal values when starting with lower accumulated weight. This gadget can be used in all reductions and will be described in the next subsection.
Put together, these two gadgets ensure that d(w)=uw for all w≥0. To complete the reductions, we add an initial gadget I depicted in Figure 3 in which each positive amount of weight w is accumulated with positive probability. Afterwards, a scheduler has to choose whether to move to state t or state s via the actions τ and σ, respectively. It is optimal to move to t if and only if uw≥0. Let now S be the scheduler always choosing τ in the initial gadget and afterwards behaving optimally when choosing from γ0,…,γk−1 and γ or δ0,…,δk−1 and δ as described above. This scheduler is optimal if and only if the given linear recurrence sequence is non-negative.
The final step to complete the reduction is to compute the value VS(sinit,0) that is achieved by S starting from the initial state. In all three reductions, we can compute this rational value via converging matrix series. The optimal value Vmax(sinit,0) that can be achieved from the initial state now satisfies
[TABLE]
if and only if the given linear recurrence sequence is non-negative.
3.2 MDP-gadget for linear recurrence relations
In this section, we demonstrate how to construct the gadget ensuring that the difference of optimal values V(s,w)−V(t,w) follows a given linear recurrence relation with respect to different weight levels w.
In the next section, the initial values of a linear recurrence sequence will be encoded in MDP-gadgets tailored to the different quantities we address.
Optimality equations.
Let us start by the following observations on the well-known relation between the optimal values at different states in the classical stochastic shortest path problem, i.e., the maximal expected accumulated weights before reaching a goal state (defined in Section 2).
Let M=(S,Act,P,sinit,wgt,Goal) be an MDP.
The solution to the classical stochastic shortest path problem satisfies the so called Bellman equation. If V(s) denotes the value when starting in state s, i.e., the maximal expected accumulated weight before reaching Goal from state s, then
[TABLE]
for s∈Goal and V(s)=0 for s∈Goal. This simple form of optimality equation implies the existence of optimal memoryless deterministic schedulers for the classical stochastic shortest path problem
(in case optimal schedulers exist, i.e., if the optimal values are finite).
For problems like the optimization of the termination probability of one-counter MDPs, it is, however, clearly not sufficient to consider the optimal values only in dependency of the starting state. The counter-value, i.e. the weight that has been accumulated so far, is essential.
So, let V(s,w) denote the maximal termination probability of a one-counter MDP when starting in state s with counter value w. Letting V(s,w)=1 if w<0, we obtain the following equation for all states s and all values w≥0:
[TABLE]
Already in this equation, the value V(s,w) hence possibly depends on values of the form V(s,w−i) for some i.
We want to exploit this interrelation to encode linear recurrence relations
[TABLE]
into the optimal values V(s,w).
Of course, the values P(s,α,t) are all non-negative. So, we cannot directly encode a linear recurrence into the optimal values for different weight levels at one state as the coefficients might be negative.
To overcome this problem, we instead consider the difference V(s,w)−V(t,w) for two different states s and t.
Scaling down coefficients of a linear recurrence sequence.
Given the coefficients α1,…,αk, and initial values u0=β0, …, uk−1=βk−1 of a linear recurrence sequence, we have to assume that these are all sufficiently small for the following constructions.
So, let us clarify why we can assume this without loss of generality and let us provide precise bounds.
Let (un)n≥0 be a linear recurrence sequence specified by the initial values u0=β0, …, uk−1=βk−1 and the linear recurrence relation
un+k=α1un+k−1+⋯+αkun
for all n≥0.
For any μ>0 and λ>0, the sequence (vn)n≥0 defined by vn=μ⋅λn⋅un for all n
is non-negative if and only if (un)n≥0 is non-negative. Furthermore, it satisfies vi=μ⋅λi⋅βi for i<k and
[TABLE]
By choosing λ and μ appropriately, we can scale down the initial values and coefficients of the recurrence relation for any given input.
To obtain precise bounds that will be used throughout the following sections, let α=def∑i=1k∣αi∣.
and let λ=defα⋅(5k+5)1.
The value λ can be computed in polynomial time.
As the numerical value of k is linear in the size of the given original input, the coefficients α1′=defλ⋅α1,α2′=defλ2⋅α2,…,αk′=defλk⋅αk of the linear recurrence of the sequence
(vn)n≥0 can be computed in polynomial time as well. The choice of λ ensures that ∑i=1k∣αi′∣<5k+51.
Let now α′=def∑i=1k∣αi′∣ and β=defmax0≤j<k∣βj∣.
We can choose μ=def4k2k+2⋅βmin(α′,1). Again, since the value k is linear in the size of the original input, μ can be computed in polynomial time.
The initial values of the new sequence (vn)n≥0 are now βi′=defvi=μ⋅λi⋅βi for i<k, computable in polynomial time.
The choice of μ guarantees that max0≤j<kβj′<min(4k2k+21,4α′).
Since this transformation can be carried out in polynomial time, we can w.l.o.g. from now on work under the following assumption:
Assumption 3.1**.**
Given the coefficients α1,…,αk, and initial values u0=β0, …, uk−1=βk−1 of a linear recurrence sequence, we assume that
[TABLE]
MDP-gadget for linear recurrence relations.
Given the coefficients α1,…,αk of a linear recurrence relation satisfying Assumption 3.1,
we construct the MDP-gadget depicted in Figure 4.
The gadget contains states s, t, and trap as well as s1,…,sk and t1,…,tk. In state t, an action γ is enabled which has weight [math] and leads to state ti with probability αi if αi>0 and to state si with probability ∣αi∣ if αi<0 for all i. The remaining probability leads to trap. From each state ti, there is an action leading to t with weight −i. The action δ enabled in s as well as the actions leading from states si to s are constructed in the analogously. If αi is negative, action δ reaches state ti with probability ∣αi∣. Otherwise it reaches si with probability αi.
The state trap is absorbing.
As the gadget depends on the inputs αˉ=(α1,…,αk), we call it Gαˉ.
This gadget Gαˉ will be integrated into MDPs without further outgoing edges from states s1,…,sk,t1,…,tk.
For any optimization problem for which the optimal values V depend on the state and the weight accumulated so far and satisfy equation (∗), we can encode a linear recurrence in an MDP containing this gadget (and possibly further actions for state t and s):
If we know that an optimal scheduler chooses action γ in state t and action δ in state s if the accumulated weight is w, then
[TABLE]
Note that this linear recurrence relation also holds for the optimal values in the classical stochastic shortest path problem for example. So, the gadget alone is not yet enough for a hardness proof.
The missing ingredient is the encoding of the initial values of a linear recurrence sequence.
In order to include the encoding of the initial values in our approach, it is necessary that optimal schedulers cannot be chosen to be memoryless.
The optimal decisions have to depend on the weight that has been accumulated in the history of a run. If this is the case, we aim to encode the initial values by adding further outgoing actions to the states t and s. By fine-tuning the weights and probabilities of these actions, we can achieve that for small weights w some of the new actions are optimal while for large weights the actions γ and δ of the gadget are optimal. If we manage to design the other actions such that the differences V(t,w+i)−V(s,w+i) are equal to given starting values βi for a sequence of weights w,w+1,…,w+k−1 while actions γ and δ are optimal for weights of at least w+k, we can encode arbitrary linear recurrence sequences.
This is the goal of the subsequent section.
4 Reductions from the Positivity problem
To encode initial values of a linear recurrence sequence, we construct further MDP gadgets.
For the termination probability and expected termination time of one-counter MDPs and for partial expectations, we can construct these gadgets directly.
For the conditional value-at-risk, we use an intermediate auxiliary random variable.
Putting together these gadgets with the gadget Gαˉ from the previous section, we obtain the basis for the Positivity-hardness results of the respective threshold problems.
The Positivity-hardness of the remaining problems is obtained as a consequence of these results via further reductions. An overview of the chains of reductions used is presented in Figure 1.
4.1 One-counter MDPs, energy objectives, cost problems, and quantiles
The first problem we will show to be Positivity-hard is the threshold problem for the optimal termination probability of one-counter MDPs. From this result, Positivity-hardness results for energy objectives, cost problems, and the computation of quantiles follow easily. Afterwards, we adjust the reduction to show Positivity-hardness of the threshold problem for the optimal expected termination time of almust-surely terminating one-counter MDPs.
Termination probability of one-counter MDPs.
We formulated the termination of a one-counter MDP in terms of weighted MDPs M. Recall that a one-counter MDP terminates if the counter value drops below zero. If we consider the weight that has been accumulated instead of the counter value, the quantities we are interested are PrMopt(◊ accumulated weight<0) for opt=max and opt=min.
The main result we prove in this section is the following:
Theorem 4.1**.**
The Positivity problem is reducible in polynomial time to the following problems:
Given an MDP M and a rational ϑ∈(0,1),
-
decide whether \mathrm{Pr}^{\max}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))>\vartheta.
2. 2.
decide whether
\mathrm{Pr}^{\min}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))<\vartheta.
Note that if weights are encoded in unary, we can transform a weighted MDP to a one-counter MDP that can only increase or decrease the counter value by 1 in each step in polynomial time.
The MDPs that are constructed from a linear recurrence sequence of depth k in the proof of Theorem 4.1 will contain only weights with an absolute value of at most k.
So, they can be transformed to one-counter MDPs in time linear in the size of the original input and we conclude that the following two threshold problems for the optimal termination probability of one-counter MDPs are Positivity-hard:
Corollary 4.2**.**
The Positivity problem is reducible in polynomial time to the following problems:
Given a one-counter MDP M viewed as an MDP with weights in {−1,0,+1} and a rational ϑ∈(0,1),
-
decide whether
\mathrm{Pr}^{\max}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))>\vartheta.
2. 2.
decide whether
\mathrm{Pr}^{\min}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))<\vartheta.
Among the direct reductions from the Positivity problem we present, the construction of the gadget encoding the initial values of a linear recurrence sequence is arguably the simplest for these optimal termination probabilities.
In the formulation with weighted MDPs, the termination of a one-counter MDP is moreover the complement of the energy objective “□ accumulated weight≥0”.
We will first prove Positivity-hardness for the threshold problem for maximal termination probabilities and line out the necessary adjustments to show Positivity-hardness also for the threshold problem for minimal termination probabilities afterwards.
We split the proof of Theorem 4.1 into four parts. First, we provide the construction of an MDP from a linear recurrence sequence. Then, we show that the linear recurrence sequence is correctly encoded in this MDP in terms of the maximal termination probabilities. To complete the proof of item 1, we then show how to compute the threshold ϑ for the threshold problem. Finally, we show how to adapt the construction to prove hardness of the threshold problem for minimal termination probabilities.
Proof of Theorem 4.1(1): construction of the MDP.
Given a linear recurrence sequence in terms of the rational coefficients α1,…,αk of the linear recurrence relation as well as the rational initial values
β0,…,βk−1 for k≥2, our first goal is to construct an MDP M and a rational ϑ∈(0,1) such that
[TABLE]
By Assumption 3.1, we can assume that the input values are sufficiently small. More precisely, we assume that
∑i=1k∣αi∣<1/(k+1) and that 0≤βj<1/(k+1) for all 0≤j≤k−1, which is ensured by the bounds in Assumption 3.1.
We denote the maximal termination probabilities in terms of the current state s and counter value (accumulated weight) w by p(s,w).
More precisely, in an MDP M for w≥0, we define
[TABLE]
The values p(s,w) in an MDP with state space S now satisfy the optimality equation (∗) from Section 3.2 (where p(s,w) takes the role of V(s,w) in (∗)), which we restate here for convenience. We have p(s,w)=1 for all states s and all w<0 and
[TABLE]
So, to capture the linear recurrence relation, we will be able to make use of the gadget Gαˉ from Section 3.2.
The missing ingredient is a gadget to encode the initial values of a linear recurrence sequence.
The new gadget Oβˉ encoding the initial values βˉ is depicted in Figure 5 and works as follows:
For 0≤j≤k−1, the action γj enabled in t leads to state xj with probability k+1k−j+βj. By assumption on βj, this probability is less than k+1k−j+1. The remaining probability leads to trap.
In state s, the action δj leads to yj with probability k+1k−j and to trap with the remaining probability.
For 0≤j≤k−1, one reaches trap from xj and yj with probability 1 and a counter change of −(j+1).
Now, we glue together the initial gadget I defined in Section 3.1, the gadget encoding the linear recurrence relation Gαˉ from Section 3.2, and the new gadget Oβˉ at states t, s, and trap.
The resulting MDP M is depicted in Figure 6 – for better readability, it is depicted for k=2 and assuming that α1≥0 while α2<0.
Proof of Theorem 4.1(1): correctness of the encoding of the linear recurrence sequence.
In this paragraph, we show that the initial linear recurrence sequence is indeed encoded in the maximal termination probabilities when starting from states t and s with different counter values, i.e., values of accumulated weight as described in Section 3.1.
More precisely, let (un)n≥0 be the linear recurrence sequence given by the initial values β0,…βk−1 and the coefficients α1,…,αk of the linear recurrence relation. We prove the following:
Lemma 4.3**.**
For each w≥0, we have
[TABLE]
where p(r,w) denotes the maximal termination probability from state r∈{s,t} when starting with accumulated weight w as defined above.
Proof.
For the correct interplay of the gadgets Gαˉ and Oβˉ, the optimal decisions in states t and s for different values of accumulated weights, i.e., different counter-values, are crucial.
In order to terminate, the accumulated weight has to drop below [math] before reaching trap. As soon as the trap state is reached with non-negative accumulated weight, the process cannot terminate anymore. The optimal decision in order to maximize the termination probability in state t is now easy to determine.
Let ℓ be the current weight. If 0≤ℓ≤k−1, choosing action γ leads to termination with probability less than 1/(k+1) as trap is reached immediately with probability at least k/(k+1) due to our assumption that ∑i≤k∣αi∣<1/(k+1). Choosing action γj makes it impossible to terminate if ℓ>j. If ℓ≤j, then choosing γj lets the process terminate if xj is reached. This happens with probability k+1k−j+βj. As βj<1/(k+1) for all j, the maximal termination probability is reached when choosing γℓ.
If ℓ≥k, then γj leads to termination with probability [math] for all j. Hence, action γ is optimal. Analogously, we see that the optimal choice in state s with weight ℓ is δℓ if ℓ≤k−1 and δ otherwise.
The linear recurrence sequence (un)n≥0 now can be found in terms of the difference
[TABLE]
For counter value w≤k−1, we have seen that γw and δw, respectively, are the optimal actions. Hence, d(w)=uw in this case as we have just seen that the optimal termination probability when starting with weight w≤k−1 is k+1k−w+βw in t and k+1k−w in s.
Furthermore, for w>k−1, actions γ and δ are optimal. So,
by the discussion in Section 3.2, the sequence of differences satisfies the linear recurrence relation given by α1,…,αk.
Therefore, d(w)=uw for all w≥0.
∎
Proof of Theorem 4.1(1): computation of the threshold ϑ.
The state choice is reached with any positive accumulated weight with positive probability.
For the optimal choices in the state choice with accumulated weight w, we observe that choosing τ is optimal iff d(w)≥0. By Lemma 4.3, this holds if and only if uw≥0.
Consider now the scheduler S which always chooses τ in state choice and afterwards behaves according to the optimal choices as described in the proof of Lemma 4.3.
This scheduler S is optimal if and only if the sequence (un)n≥0 is non-negative.
To complete the reduction, we will compute the value
[TABLE]
We will see that ϑ is a rational computable in polynomial time and we know that
[TABLE]
if and only if the scheduler S is optimal which is the case iff (un)n≥0 is non-negative.
Lemma 4.4**.**
In the constructed MDP M,
the value ϑ=PrM,sinitS(◊(accumulated weight <0)) can be computed in polynomial time.
Proof.
In order to compute the value ϑ,
we first provide a recursive expression of the maximal termination probabilities p(t,w) and p(s,w). By the definition of S, these are precisely the termination probabilities under S when starting from t or s with some positive accumulated weight w∈N because S behaves optimally as soon as state t or s has been reached.
For this recursive expression, we consider the following Markov chain C for n∈N that is also depicted in Figure 7 – for better readability, it is depicted for the case k=2 there:
The Markov chain C has 5k states named t−k+1, …, t+k, s−k+1, …, s+k, and goal+1, …, goal+k.
States t−k+1, …, t0, s−k+1, …, s0, and goal+1, …, goal+k are terminal.
For 0<i,j≤k, there are transitions from t+i to t+i−j with probability αj if αj>0, to s+i−j
with probability ∣αj∣ if αj<0, and to goal+i with probability 1−∣α1∣−…−∣αk∣. Transitions from s+i are defined analogously.
The idea behind this Markov chain is that the reachability probabilities describe how, for arbitrary n∈N and 1≤i≤k,
the values p(t,nk+i) and p(s,nk+i) depend on the values p(t,(n−1)k+j) and p(s,(n−1)k+j) for 1≤j≤k.
The transitions in C behave as γ and δ in M, but the decrease in the accumulated weight is explicitly encoded into the state space.
Namely, for n∈N and 0<i≤k, we have
[TABLE]
and analogously for p(s,nk+i).
We now group the optimal values together in the following column vectors
[TABLE]
for n∈N. In other words, this vector contains the optimal values for the partial expectation when starting in t or s with an accumulated weight from {nk,…,nk+k−1}.
The vector v0 is the column vector
[TABLE]
and these values occur as transition probabilities in M under the actions γk−1,…,γ0 and δk−1,…,δ0.
As the reachability probabilities in C are rational and computable in polynomial time, we conclude from equation (∗ ‣ 4.1)
that there is a matrix A∈Q2k×2k computable in polynomial time such that
vn+1=Avn
for all n∈N.
So, vn=Anv0 for all n∈N.
As state choice is reached with weight w with probability (1/2)w for all w≥1,
the value ϑ=∑w=1∞(1/2)wp(t,w).
Let c=(2k1,2k−11,…,211,0,…,0).
Observe that for all n∈N,
[TABLE]
Hence, we can write
[TABLE]
We have to subtract p(t,0) as the state choice cannot be reached with weight [math], but the summand 1⋅p(t,0) occurs in the sum. As p(t,0)=k+1k+β0, this does not cause a problem.
We claim that the matrix series involved converges to a rational matrix.
We observe that the maximal row sum in A is at most ∣α1∣+…+∣αk∣<1 because the rows of the matrix contain exactly the probabilities to reach t0, …t−k+1, s0, …, and s−k+1 from a state t+i or s+i in C for 1≤i≤k. But the probability to reach goal+i from these states is already 1−∣α1∣−…−∣αk∣. Hence, ∥A∥∞, the operator norm induced by the maximum norm ∥⋅∥∞, which equals maxi∑j=12k∣Aij∣, is less than 1.
So, in particular, also ∥2k1A∥∞<1 and hence the Neumann series ∑n=0∞(2k1A)n converges to (I2k−2k1A)−1 where I2k is the identity matrix of size 2k×2k. So,
[TABLE]
is computable in polynomial time.
∎
All in all, this finishes the proof of point (1) of Theorem 4.1:
We have seen that the MDP M and the threshold ϑ can be constructed in time polynomial in
the size of the representations of α1,…,αk and β0,…,βk−1.
As
ϑ=PrM,sinitS(◊(accumulated weight <0)),
we furthermore know that
[TABLE]
if and only if the scheduler S is not optimal. By Lemma 4.3, this is the case if and only if
the given linear recurrence sequence (un)n≥0 has a negative member.
Finally, we want to emphasize again that the absolute values of the weights in the constructed MDP are at most k. Hence, if we want to view M as a one-counter MDP in which the counter value can only be increased or decreased by 1 in each step, the constructed MDP becomes only polynomially larger after we replace the transitions with a weight +w or −w for a 1≤w≤k by a sequence of w states decreasing or increasing the counter value, which allowed us to conclude Corollary 4.2.
Proof of Theorem 4.1(2).
The construction we provided so far shows that the threshold problem for the maximal termination probability of one-counter MDPs is Positivity-hard.
Using exactly the same ideas, we can show that the threshold problem for the minimal termination probability is Positivity-hard as well. Let us describe the necessary changes in the construction that are also depicted in Figure 8.
We rename the state trap to trap′ and add a transition with weight −k to a new absorbing state trap.
For all 0≤j≤k−1, now state trap is reached directly with probability 1 and weight −j from the states xj and yj.
Furthermore, the probability to reach xj when choosing γj in t is changed to k+1j+1+βj and the probability to reach trap′ is adjusted accordingly. The analogous change is performed for δj.
Now, it is easy to check that the optimal choice to minimize the termination probability in state t is to choose γ if the accumulated weight is ≥k. In this case the probability of termination is less than k+11. If the accumulated weight is 0≤ℓ<k, the optimal choice is γℓ. The analogous result holds in state s. From then on the proof is analogous to the proof
for the maximal termination probability
with the change that we have to consider the scheduler S always choosing σ in the state choice this time.
This scheduler is optimal to minimize the termination probability if and only if the given linear recurrence sequence is non-negative.
With these adjustments, we conclude:
Corollary 4.5**.**
The Positivity problem is reducible in polynomial time to the following problem:
Given an MDP M and a rational ϑ∈(0,1), decide whether
[TABLE]
Remark 4.6**.**
There is no obvious way to adjust the construction such that the Positivity-hardness of the question whether \mathrm{Pr}^{\max}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))\geq\vartheta would follow. One attempt would be to provide an ε such that \mathrm{Pr}^{\max}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))>\vartheta if and only if \mathrm{Pr}^{\min}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))\geq\vartheta+\varepsilon. This, however, probably requires a bound on the position at which the given linear recurrence sequence first becomes negative. But this question lies at the core of the Positivity problem. The analogous observation applies to
the question whether
\mathrm{Pr}^{\min}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\lozenge(\text{accumulated weight <0}))\leq\vartheta and all Positivity-hardness results in the sequel.
Energy objectives.
As the energy objective \Box(\text{accumulated weight \geq 0}) is satisfied if and only if \lozenge(\text{accumulated weight <0}) does not hold, the Positivity-hardness of the threshold problem of the optimal satisfaction probability of an energy objective follows easily. As
[TABLE]
we conclude:
Corollary 4.7**.**
The Positivity problem is reducible in polynomial time to the following problems:
Given an MDP M and a rational ϑ∈(0,1),
-
decide whether
\mathrm{Pr}^{\max}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\Box(\text{accumulated weight \geq 0}))>\vartheta.
2. 2.
decide whether
\mathrm{Pr}^{\min}_{\mathcal{M},s_{\mathit{\scriptscriptstyle init}}}(\Box(\text{accumulated weight \geq 0}))<\vartheta.
Cost problems and quantiles.
The proof of the Positivity-hardness of the threshold problem for the termination probability of one-counter MDPs in fact also serves as a proof that cost problems and the computation of quantiles of the accumulated weight before reaching a goal state are Positivity-hard.
Observe that in the MDP constructed for Theorem 4.1 and Corollary 4.5, almost all paths ζ under any scheduler satisfy
\lozenge(\text{accumulated weight <0}) iff they satisfy \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturetrap(ζ)<0 iff their total accumulated weight is less than [math]. Thus, we obtain the following corollary:
Corollary 4.8**.**
The Positivity problem is reducible in polynomial time to the following problems:
Given an MDP M with a designated set of trap states Goal and a rational ϑ∈(0,1),
-
decide whether
PrM,sinitmax(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal<0)>ϑ.
2. 2.
decide whether
PrM,sinitmin(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal<0)<ϑ.
The analogous result also holds for the total accumulated weight.
Termination times of one-counter MDPs.
To conclude the section, we show that not only the threshold problems for optimal termination probabilities, but also for the optimal expected termination times in one-counter MDPs that terminate almost surely is Positivity-hard.
We again work with weighted MDPs.
Let T be the random variable that assigns to each path in a weighted MDP M the length of the shortest prefix π such that wgt(π)<0.
To reflect precisely the behavior of a one-counter MDP, we now will work with MDPs where the weight is reduced or increased by at most 1 in each step. We make a small change to the MDP constructed for the proof of Corollary 4.5 that is depicted in Figure 8. The initial component (that is not depicted) stays unchanged. For the remaining transitions, all transition reduce the weight or leave it unchanged. The transitions with weight [math] do not occur directly after each other except for the loop at the state trap that we adjust in a moment. Hence, we can add additional auxiliary states such that along each path starting from s or t not reaching the state trap, the weight is left unchanged and reduced by 1 in an alternating fashion.
So, if a path starts in state s or t with accumulated weight w and terminates (i.e. reaches accumulated weight −1) before reaching the state trap this takes 2(w+1) steps.
Now, we replace the loop at the state trap by the gadget depicted in Figure 9 and let us call the resulting MDP N.
So, when reaching trap the accumulated weight is increased by 1 before it is reduced in every other step until termination.
That means that if a path starting in state s or t with weight w does not terminate before reaching trap, the termination time is 2(w+1)+3 steps.
Now, let S be a scheduler and denote the probability not to terminate before reaching trap under S by pS. For the expected termination time T in N, we now have
[TABLE]
The summands (1/2)i(i+2(i+1)) correspond to the probability to accumulated weight i in the initial component which takes i steps and the 2(i+1) steps needed to terminate by alternatingly leaving the weight unchanged and reducing it by 1. The three additional steps after trap occur precisely with probability pS.
Not terminating before trap corresponds exactly to not terminating at all in the MDP constructed for Corollary 4.5. The termination probability there is hence 1−pS for any scheduler. It is hence possible to terminate with a probability less than ϑ in that MDP if and only if it is possible to reach an expected termination time of more than 10−3ϑ in N.
By Corollary 4.5 and the fact that termination is reached almost surely in N under any scheduler, we hence conclude:
Corollary 4.9**.**
Let M be a one-counter MDP with initial state sinit that terminates almost surely under any scheduler, let ϑ be a rational, and let T be the random variable assigning the termination time to runs. The Positivity problem is polynomial-time reducible to the problem whether
[TABLE]
The analogous argument with similar changes to the MDP used in the proof of Theorem 4.1 can be used to show the analogous result for the problem whether EM,sinitmin(T)<ϑ.
4.2 Partial and conditional stochastic shortest path problems
Our next goal is to prove that the partial and conditional SSPPs are Positivity-hard. We start by providing a formal definition of the decision versions of these two problems.
Let M be an MDP with a designated set of terminal states Goal.
We define the random variable ⊕Goal on maximal paths ζ of M:
[TABLE]
The objective in the partial SSPP is to maximize the expected value of ⊕Goal which we call the partial expected accumulated weight, or partial expectation for short, i.e., to compute the value
[TABLE]
where the supremum ranges over all schedulers S.
The threshold problem asks, given a rational ϑ, whether
[TABLE]
Note that the minimization of the partial expectation can be reduced to the maximization by multiplying all weights in M with −1.
The conditional expectation under a scheduler S that reaches Goal with positive probability is the value
[TABLE]
Again, we are interested in the maximal value
[TABLE]
where the supremum ranges over all schedulers S with PrMS(◊Goal)>0.
Consequently, the threshold problem asks for a given rational ϑ whether
[TABLE]
Again, multiplying all weights with −1 reduces the minimization of the conditional expectation to the maximization.
Furthermore, given a further set of states F, the problem to maximize EMS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpictureGoal∣◊F) among all schedulers S that reach F with positive probability can be reduced to the conditional SSPP in our formulation as shown in [BKKW17]222In [BKKW17], only MDPs with non-negative weights are considered. The reduction of [BKKW17], however, does not require the restriction to non-negative weights..
Partial SSPP.
In the sequel, we will provide a direct reduction from the Positivity problem to the partial SSPP using out modular approach via MDP-gadgets to prove the following result:
Theorem 4.10**.**
The Positivity problem is polynomial-time reducible to the decision version of the partial SSPP, i.e., the question whether
[TABLE]
for a given MDP M and a given rational ϑ.
Again, we split up the proof of the theorem into the construction of the MDP with the proof of the correctness of the encoding of the linear recurrence sequence and the computation of the threshold ϑ.
Proof of Theorem 4.10: construction of the MDP and correctness of the encoding of a linear recurrence sequence.
Let k be a natural number and let (un)n≥0 be the linear recurrence sequence given by rationals αi for 1≤i≤k and βj for 0≤j≤k−1 via u0=β0, …, uk−1=βk−1 and
un+k=α1un+k−1+⋯+αkun
for all n≥0.
By Assumption 3.1, we can assume w.l.o.g. that ∑i∣αi∣<41 and that 0≤βj<4k2k+21 for all j.
We begin by constructing a gadget Pβˉ that encodes the initial values β0,…,βk−1.
The gadget is depicted in Figure 10 and contains states t, s, goal, and fail. For each 0≤j≤k−1, it additionally contains states xj and yj. In state xj, there is one action enabled that leads to goal with probability 2k2(k−j)1+βj and to fail otherwise. From state yj, goal is reached with probability 2k2(k−j)1 and fail otherwise. In state t, there is an action γj leading to xj with weight k−j for each 0≤j≤k−1. Likewise, in state s there is an action δj leading to yj with weight k−j for each 0≤j≤k−1.
We furthermore reuse the initial gadget I and the gadget encoding the linear recurrence relation Gαˉ from the previous section.
In the gadget Gαˉ, we rename the absorbing state trap to the terminal state goal which is the target state for the partial SSPP.
As before, we glue together the three gadgets I, Gαˉ and Pβˉ at states s, t, and goal.
Let us call the full MDP that we obtain in this way M which is depicted in Figure 11. We denote the state space by S.
The cumbersome choices of probability values lead to the following lemma showing the correct interplay between the gadgets constructed via straight-forward computations.
Lemma 4.11**.**
Consider the full MDP M. Let 0≤j≤k−1. Starting with weight −(k−1)+j in state t or s, action γj and δj maximize the partial expectation. For positive starting weight, γ and δ are optimal.
Proof.
Suppose action γi is chosen in state t when starting with weight −(k−1)+j. So, state xi is reached with weight −(k−1)+j+(k−i)=1+j−i. Then the partial expectation achieved from this situation is
[TABLE]
For i>j this value is ≤0 and hence γi is certainly not optimal. For i=j, we obtain a partial expectation of
[TABLE]
For i<j, state xi is reached with weight 1+j−i≤k. Further, βi≤4k2k+21 and 2k2(k−i)1≤2k2(k−j)⋅k21. So, the partial expectation obtained via γi is at most
[TABLE]
So, indeed action γj maximizes the partial expectation among the actions γi with 0≤i≤k−1 when the accumulated weight in state t is −(k−1)+j.
The argument for state s is the same with βi=0 for all i.
It is easy to see that for accumulated weight −(k−1)+j with 0≤j≤k−1 actions γ or δ are not optimal in state t or s: If goal is reached immediately, the weight is not positive and otherwise states t or s are reached with lower accumulated weight again. The values βj are chosen small enough such that also a switch from state t to s while accumulating negative weight does not lead to a higher partial expectation.
For positive accumulated weight w, the optimal partial expectation when choosing γ first is at least 43w by construction and the fact that a positive value can be achieved from any possible successor state via one of the actions γj and δj with 0≤j≤k−1. Choosing γj on the other hands results in a partial expectation of at most (k+w)⋅(4k2k+21+2k21) which is easily seen to be less as k≥2.
∎
For each weight w, denote by e(t,w) and e(s,w) the optimal partial expectation when starting in state t or s with accumulated weight w in M as if the respective state was reached from the initial state with weight w and probability 1.
For each weight w≥−k+1, denote by d(w) the difference e(t,w)−e(s,w) between these optimal partial expectation when starting in state t and s with weight w.
Comparing action γj and δj for starting weight −(k−1)+j, we conclude that the difference between optimal values d(−(k−1)+j) is equal to βj, for 0≤j≤k−1.
The important fact we use next is that for partial expectations, the optimal values e(r,w) for states r∈S∖{goal} and starting weights w∈Z satisfies
the optimality equation (∗) from Section 3.2 when setting e(goal,w)=w as already shown in [CFK*+*13a]:
[TABLE]
By the fact that Gαˉ encodes the given linear recurrence relation as soon as γ and δ are the optimal actions as shown in Section 3.2, we conclude the following lemma:
Lemma 4.12**.**
Consider the linear recurrence sequence (un)n≥0 given above by α1,…,αk and β0,…,βk−1 and the MDP M constructed from this sequence.
We have
[TABLE]
for all n with the values d(w) just defined.
Proof of Theorem 4.10: computation of the threshold ϑ.
Let us now consider a run of the MDP M. For any w>0, state c is reached with accumulated weight w with positive probability. As before, an optimal scheduler has to decide whether the partial expectation when starting with weight w is better in state s or t:
Action τ is optimal in c for accumulated weight w if and only if d(w)≥0.
Once t or s is reached, the optimal actions are given by Lemma 4.11.
Let S be the scheduler that always chooses τ in c and actions γ,γ0,…,γk−1,δ,… as described in Lemma 4.11.
We conclude that S is optimal if and only if the given linear recurrence sequence is non-negative.
The remaining step is hence in our reduction is hence to prove that the partial expectation under S is rational and can be computed in polynomial time:
Lemma 4.13**.**
Let S be the scheduler for the constructed MDP M always choosing τ in c and actions γ,γ0,…,γk−1,δ,… as described in
Lemma 4.11. The value PEMS is rational and computable in polynomial time.
Proof.
Recall that the scheduler S chooses γ and δ, respectively, as long as the accumulated weight is positive. For an accumulated weight of −(k−1)+j for 0≤j≤k−1, it chooses actions γj and δj, respectively.
Analogously to the proof of Lemma 4.4, we want to recursively express the partial expectations under S starting from t or s with some positive accumulated weight n∈N which we again denote by e(t,n) and e(s,n), respectively.
In order to do so, we reuse the following Markov chain C from Lemma 4.4
also depicted in Figure 7 which we briefly recall here:
The Markov chain C has 5k states named t−k+1, …, t+k, s−k+1, …, s+k, and goal+1, …, goal+k.
States t−k+1, …, t0, s−k+1, …, s0, and goal+1, …, goal+k are absorbing.
For 0<i,j≤k, there are transitions from t+i to t+i−j with probability αj if αj>0, to s+i−j
with probability ∣αj∣ if αj<0, and to goal+i with probability 1−∣α1∣−…−∣αk∣. Transitions from s+i are defined analogously.
The idea behind this Markov chain is that the reachability probabilities describe how, for arbitrary n∈N and 1≤i≤k,
the values e(t,nk+i) and e(s,nk+i) depend on n and the values e(t,(n−1)k+j) and e(s,(n−1)k+j) for 1≤j≤k.
The transitions in C behave as γ and δ in M, but the decrease in the accumulated weight is explicitly encoded into the state space.
Namely, for n∈N and 0<i≤k, we have
[TABLE]
and analogously for e(s,nk+i).
We now group the optimal values together in the following column vectors
[TABLE]
for n∈N. In other words, this vector contains the optimal values for the partial expectation when starting in t or s with an accumulated weight from {nk+1,…,nk+k}.
Further, we define the vector containing the optimal values for weights in {−k+1,…,0} which are the least values of accumulated weight reachable under scheduler S.
[TABLE]
As we have seen, these values are given as follows:
[TABLE]
for 0≤j≤k−1.
As the reachability probabilities in C are rational and computable in polynomial time, we conclude from (4.2)
that there are a matrix A∈Q2k×2k, and vectors a and b in Q2k computable in polynomial time such that
[TABLE]
for all n∈N. We claim that the following explicit representation for n≥−1 satisfies this recursion:
[TABLE]
We show this by induction:
Clearly, this representation yields the correct value for v−1. So, assume vn=An+1v−1+∑i=0n(n−i)Aia+∑i=0nAib. Then,
[TABLE]
So, we have an explicit representation for vn.
The value we are interested in is
[TABLE]
Let c=(2k1,2k−11,…,211,0,…,0).
Then,
[TABLE]
Hence, we can write
[TABLE]
We claim that all of the matrix series involved converge to rational matrices. As in the proof of Lemma 4.4, we observe that the maximal row sum in A is at most ∣α1∣+…+∣αk∣<1 because the rows of the matrix contain exactly the probabilities to reach t0, …t−k+1, s0, …, and s−k+1 from a state t+i or s+i in C for 1≤i≤k. But the probability to reach goal+i from these states is already 1−∣α1∣−…−∣αk∣. Hence, ∥A∥∞, the operator norm induced by the maximum norm ∥⋅∥∞, which equals maxi∑j=12k∣Aij∣, is less than 1.
So, of course also ∥2k1A∥∞<1 and hence the Neumann series ∑n=0∞(2k1A)n converges to (I2k−2k1A)−1 where I2k is the identity matrix of size 2k×2k. So,
[TABLE]
Note that ∥A∥∞<1 also implies that I2k−A is invertible. We observe that for all n,
[TABLE]
which is shown by straight-forward induction.
Therefore,
[TABLE]
Finally, we show by induction that
[TABLE]
This is equivalent to
[TABLE]
For n=0, both sides evaluate to [math]. So, we assume the claim holds for n.
[TABLE]
The remaining series is the following:
[TABLE]
We conclude that all expressions in the representation of PEMS above are rational and computable in polynomial time.
∎
As we have seen, the originally given linear recurrence sequence contains a negative member if and only if the scheduler S is not optimal.
This is the case if and only if
PEMmax>PEMS for the MDP M constructed from the linear recurrence sequence in polynomial time above.
This finishes the proof of Theorem 4.10.
Conditional SSPP.
For the Positivity-hardness of the threshold problem for conditional expectations, we provide a reduction from the threshold problem for partial expectations in the following lemma. Note that a reduction in the other direction is provided in [PB19] rendering the two problems polynomial-time inter-reducible.
Lemma 4.14**.**
The threshold problems for the partial SSPP is polynomial-time reducible to the threshold problem of the conditional SSPP.
Proof.
Let M be an MDP with a designated terminal target state goal and let ϑ be a rational number. We construct an MDP N such that PEMmax>ϑ if and only if CENmax>ϑ.
We obtain N by adding a new initial state sinit′, renaming the state goal to goal′, and adding a new state goal to M. In sinit′, one action with weight [math] is enabled leading to the old initial state sinit and to goal with probability 1/2 each. From goal′ there is one new action leading to goal with probability 1 and weight +ϑ.
Each scheduler S for M can be seen as a scheduler for N and vice versa.
Now, we observe that for any scheduler S,
[TABLE]
Hence, PEMmax>ϑ if and only if CENmax>ϑ.
∎
Together with the Positivity-hardness of the threshold problem for partial expectations (Theorem 4.10), we conclude:
Theorem 4.15**.**
The Positivity problem is reducible in polynomial time to the following problem: Given an MDP M and a rational ϑ, decide whether CEMmax>ϑ.
Two-sided partial SSPP
To conclude this section, we prove the Positivity-hardness of a two sided version of the partial SSPP with two non-negative weight functions. This result will form the basis to further Positivity-hardness results in the subsequent section.
The key idea is that, instead of using arbitrary integer weights, we can simulate the non-monotonic behavior of the accumulated weight along a path in the partial SSPP with arbitrary weights with two non-negative weight functions.
In the definition of the random variable ⊕Goal, we can replace the choice that paths not reaching Goal are assigned weight [math] by a second weight function.
Let M=(S,Act,Pr,sinit,wgtgoal,wgtfail,goal,fail) be an MDP with two designated terminal states goal and fail and two non-negative weight functions wgtgoal:S×Act→N and wgtfail:S×Act→N. Assume that the probability PrM,sinitmin(◊{goal,fail})=1.
Define the following random variable X on maximal paths ζ:
[TABLE]
Due to the assumption that goal or fail is reached almost surely under any scheduler, the expected value EM,sinitS(X) is well-defined for all schedulers S for M.
We call the value EM,sinitmax(X)=supSEM,sinitS(X) the optimal two-sided partial expectation.
We can show that the threshold problem for the two-sided partial expectation is Positivity-hard as well by a small adjustment of the construction above.
Theorem 4.16**.**
The Positivity problem is polynomial-time reducible to the following problem:
Given an MDP M=(S,Act,Pr,sinit,wgtgoal,wgtfail,goal,fail) as above and a rational ϑ, decide whether EM,sinitmax(X)>ϑ.
Proof.
Given the parameters α1,…,αk and β0,…,βk−1 of a rational linear recurrence sequence, we can construct an MDP M′=(S,Act,Pr,sinit,wgt,goal,fail) with one weight function wgt:S×Act→Z
similar to the MDP M depicted in Figure 11.
W.l.o.g., we again assume that ∑i∣αi∣<41 and that 0≤βj<4k2k+21 for all j.
The initial gadget and the gadget Gαˉ are as before.
The gadget Pβˉ, however, is slightly modified and replaced by the gadget Tβˉ depicted in Figure 12.
For this gadget, we define α=∑i=1k∣αi∣, p1=(1−α)(2k2(k−j)1+βj), p2=(1−α)(1−(2k2(k−j)1+βj)), q1=(1−α)2k2(k−j)1, and q2=(1−α)(1−2k2(k−j)1).
With the transitions as in the figure, the probability to reach goal or fail and the weight accumulated does not change when choosing action γj or δj compared to the gadget Pβˉ.
The only difference is that the expected time to reach goal or fail changes.
The steps alternate between probability α and probability [math] to reach goal or fail – just as in the gadget Gαˉ. In this way, it makes no difference for the expected time before reaching goal or fail when a scheduler stops choosing γ and δ.
We can, in fact, compute the expected time T to reach goal or fail from sinit under any scheduler quite easily:
Reaching t or s takes 3 steps in expectation. Afterwards, two further steps are taken 1/α-many times in expectation. So,
[TABLE]
The optimal scheduler S for the partial expectation in M′ is the same as in the MDP M above. Also the value ϑ of this scheduler can be computed as in Theorem 4.13.
So, PEM′,sinitmax>ϑ if and only if the given linear recurrence sequence is eventually negative.
Note that all weights in M′ are ≥−k.
We define two new weight functions to obtain an MDP N from M′: We let wgtgoal(s,α)=wgt(s,α)+k and wgtfail(s,α)=+k for all (s,α)∈S×Act. Both weight functions take only non-negative integer values.
Any scheduler S for M′ can be viewed as a scheduler for N, and vice versa, as the two MDPs only differ in the weight functions.
Further, we observe that for each maximal path ζ ending in goal or fail in M′ and at the same time in N, we have X(ζ)=⊕goal(ζ)+k⋅length(ζ).
(Recall that ⊕goal(ζ) equals wgt(ζ) if ζ reaches goal and [math] if ζ reaches fail.)
As the expected time before goal or fail is reached is constant, namely T under any scheduler, it follows that for all schedulers T we have
[TABLE]
Therefore, EN,sinitmax(X)>ϑ+k⋅T if and only if the given linear recurrence sequence eventually becomes negative.
∎
While the two-sided partial expectation is certainly interesting in its own right, it will also play an important role in the proof of the Positivity-hardness of the threshold problem for the optimal long-run probability of a regular co-safety property in the next section.
4.3 Long-run probabilities and frequency-LTL
We now turn our attention to problems concerning the long-run satisfaction of path properties. On first sight, these problems seem to be of quite a different nature compared to the problems addressed so far.
Long-run probabilities are an average of the probability after each step that a property φ is satisfied on the suffix of an execution starting after that step. Similarly, frequency-LTL allows us to express that
the long-run fraction of suffixes of an execution satisfying φ lies above a threshold. Nevertheless, upon closer inspection, the two-sided version of partial expectations considered at the end of the previous section
actually shares similarities with these long-run satisfaction problems. This allows us to provide a reduction from the threshold problem for two-sided partial expectations to the threshold problem for long-run probabilities of a simple fixed co-safety property.
We begin by formally defining long-run probabilities.
Long-run probability.
Let M=(S,Act,P,sinit,AP,L) be an MDP
and let φ be a path property.
The long-run probability of φ
on an infinite path ζ under a
scheduler S for M is defined as as the long-run average
of the probabilities for φ in all positions of
ζ with respect to the residual
schedulers S↑ζ[0…i]
defined by
[TABLE]
for finite paths π starting in ζ[i]:
[TABLE]
The long-run probability of property φ under scheduler S
from state s, denoted
LPM,sS(φ), is defined as the expectation
of the random variable ζ↦lrpφS(ζ)
under S with starting state s:
[TABLE]
We now address the task to compute the extremal
long-run probabilities for φ:
[TABLE]
where S ranges over all schedulers for M.
In contrast to classical optimization problems for MDPs, the random
variable whose expectation we aim to optimize, namely
lrpφS, depends on the scheduler S itself.
Example 4.17**.**
To illustrate the notion of long-run probability, consider the following example, which is a simplification of an example from [BBPS19].
Let N be the MDP shown in Fig. 13. The only
non-deterministic choice is the choice between actions α and β in state a. Action α
yields a uniform distribution over the three successors.
We want to determine the maximal long-run probability of aUb.
Under the memoryless scheduler Sα that always picks
action α, the probability of aUb in the a-state is
21 under this scheduler. The states b1 and c1 appear equally often.
The probability of aUb is 1 in state b1 and [math] in state c1. We thus conclude that the long-run probability under
Sα is 21. Similarly, the steady-state
probability of the states a and b2 under the memoryless scheduler
Sβ are 41, and the probability
that aUb holds from there is 1. The long-run probability of aUb under
Sβ equals 21 as well.
Interestingly, these two memoryless schedulers are not optimal.
Consider the scheduler S that chooses α first and, if it returns to a directly, chooses β afterwards.
In the first visit to the a state, the probability for aUb is 32.
States b1 and c1 are reached with probability 1/3 afterwards. If state a is reached again directly, the probability of aUb is now 1. Also state b2 is reached with probability 1/3 before returning to a from b1, c1, or c2. Tho compute the long-run probability under this scheduler, we sum up the satisfaction probabilities for all states that can be visited before returning to a from b1, c1, or c2 multiplied with the probabilities of the visits. We divide the result by the expected number of steps before returning. Note that we sum up probability 2/3+1/3⋅1 for the two possible visits to state a. We obtain a long-run probability of
[TABLE]
Preliminary results.
Before we prove the Positivity-hardness result for long-run probabilities, we need some results on the connection between long-run probabilities and mean payoffs.
The results are presented in [Pir21] and we briefly state the necessary results in the sequel.
We will prove that the threshold problem for the maximal long-run probability is Positivity-hard already for very simple co-safety properties φ.
To represent regular co-safety properties, we will use deterministic finite-automata (DFA). A run ζ satisfies the represented co-safety property if some prefix of L(ζ) is accepted by the DFA.
In [Pir21], a construction is provided that allows us to express the optimal long-run probability of regular co-safety properties in terms of expected mean payoffs. We briefly present this reduction and the
corresponding results here:
The main idea is to construct an MDP with extended state space that keeps track of the number of runs currently in each state of the automaton. We can then assign a weight to each step of the MDP depending on how many runs of the DFA enter an accepting state during that step.
The optimal mean payoff in the constructed MDP then coincides with the optimal long-run probability in the original MDP.
However, there is no bound on the number of runs we have to store in the state space for this construction. Therefore, the constructed MDP will have an infinite state space.
Let M=(S,Act,P,sinit,AP,L) be a strongly connected MDP and let D=(Q,2AP,δ,q0,F) be a DFA over AP. As we are interested in the co-safety property given by D, only runs of D up to the first accepting state are relevant. Hence,
we can collapse all accepting states of D to one absorbing state accept and all states form which accept is not reachable to one state reject.
Let the set of states Q={q0,q1,…,qℓ,accept,reject} for some ℓ∈N.
We construct a weighted infinite-state MDP MD=(S′,Act,P′,sinit′,wgt) in the sequel.
The state space is
[TABLE]
The ℓ+1 natural numbers in a state store the number of runs of D on suffixes of the path produced by the MDP so far that are in the respective state of D.
The actions Act are the same as in M.
For the transition probability function P′ we define the following:
Let s′=(s,n0,…,nℓ) and t′=(t,m0,…,mℓ) be states such that for all i,
[TABLE]
where ιi=1 if i=0 and ιi=0 otherwise.
For such states, we set
[TABLE]
All other transition probabilities are [math].
The weight function does not work on state-weight pairs as usual, but on single transitions in S′×Act×S′.
For a transition (s′,α,t′) with s′=(s,n0,…,nℓ) and t′=(t,m0,…,mℓ), the weight is defined by
[TABLE]
To obtain a weight function on state-weight pairs, one could now take the weighted average over all possible transitions that can be taken via a state action-pair. As we will be interested in the mean payoff under this weight function, this change would not influence the subsequent considerations.
The initial state sinit′ is (sinit,1,0,…,0).
We observe that the sum of the entries in the last ℓ+1 components increases by at most 1 in each step. Hence, the total accumulated weight after n steps along any path is bounded by n and we can already conclude that the mean payoff in MD is bounded by 1 along each path.
A scheduler for M can be used as a scheduler for MD and vice versa as transitions in MD are uniquely defined by the transitions in the M component. If we consider a scheduler S for both M and MD, there is, however, one caveat: If S is a finite memory-scheduler for MD, the same scheduler is not necessarily a finite-memory scheduler for M. So, we want to emphasize that the following lemma states that the maximal mean payoff in MD can be approximated by schedulers that are still finite-memory schedulers when considered as schedulers for M.
Lemma 4.18** ([Pir21]).**
Let M and D be given as above and let MD be the constructed MDP.
For each scheduler T for MD and each ε>0,
there is a finite-memory scheduler F for M such that, if F is seen as a scheduler for MD:
[TABLE]
For finite-memory schedulers, the mean-payoff in MD and the long-run probability of the co-safety property given by D in M coincide:
Lemma 4.19** ([Pir21]).**
Let M and D be given as above and let MD be the constructed MDP.
Then, for each finite-memory scheduler S for M (also viewed as a scheduler for MD), we have LPM,sinitS(D)=EMD,sinit′S(MP).
As shown in [Pir21], using Fatou’s lemma, one can show that also the optimal long-run probability can be approximated by finite-memory schedulers:
Lemma 4.20** ([Pir21]).**
Let M and φ be given as above.
For each scheduler T for M and each ε>0,
there is a finite-memory scheduler F for M such that:
[TABLE]
It follows that indeed, the optimal long-run probability can be expressed as an optimal expected mean payoff via the provided reduction.
Theorem 4.21** ([Pir21]).**
Let M and D be as above. Let MD be the infinite-state MDP constructed from M and D as described above.
Then,
[TABLE]
Positivity-hardness.
So far,
we have seen that the optimal long-run probability of a regular co-safety can be expressed in terms of an optimal expected mean-payoff.
This insight allows us to draw a connection between long-run probabilities and two sided-partial expectations that we just discussed. Recall that for an MDP M=(S,Act,Pr,sinit,wgtgoal,wgtfail,goal,fail) with two designated absorbing states goal and fail and two non-negative weight functions wgtgoal:S×Act→N and wgtfail:S×Act→N, the two-sided partial expectation was defined as the expectation of the following random variable X on maximal paths ζ:
[TABLE]
We now exploit the construction of MD provided above to mimic a behavior similar to the payoff according to the random variable X.
Consider the DFA D depicted in Figure 14. The state space is Q={qinit,q1,q2,accept,reject}.
The alphabet is 2{a,b,c,goal,fail}. From the initial state letters satisfying a∧b∧¬c lead to q1, letters satisfying a∧c∧¬b to q2 and all remaining letters to reject. From q1, letters satisfying a∧¬goal lead back to q1, letters with goal∧¬a to accept, and all remaining letters lead to reject. Transitions from q2 are defined analogously with goal replaced by fail.
Consider a run ρ of an MDP M labeled with {a,b,c,goal,fail} for which we keep counters of the number of runs on suffixes of ρ in each of the states of D: We only need counters c1 and c2 for states q1 and q2 as these are the only states multiple runs can be in before being accepted or rejected.
The update of the counters in the MDP MD can directly be determined from the DFA D: E.g., if {a,b} is read, counter c1 is increased; if {a,c} is read, counter c2 is increased. On {a}, both counters stay the same. If no a is read, the counters are reset to [math]. If at the same time goal is read, the value of c1 is received as weight. If fail is read, the value of c2 is received as weight.
So, the behavior of the counters is very similar to the accumulation of two non-negative weight functions. Which of the two weight functions or the two counters is used to determine the payoff depends on whether goal or fail is reached next.
In the sequel, we will show that indeed already the fixed co-safety property of this simple DFA D suffices to prove Positivity-hardness of the threshold problem for long-run probabilities.
The proof of the Positivity-hardness of the threshold problem for the two-sided partial expectation with non-negative weights contains most of the necessary ingredients we need:
Let (un)n≥0 be a rational linear recurrence sequence given by initial values β0,…,βk−1 and the coefficients α1,…,αk of the recurrence.
In the proof of Theorem 4.16, we showed that we can construct an MDP M=(S,Act,Pr,sinit,wgtgoal,wgtfail,goal,fail) and rationals ϑ, T with the following properties from the given parameters:
For the two designated states goal and fail, we have PrM,sinitmin(◊{goal,fail})=1.
The expected number of steps until goal or fail is reached is T under any scheduler.
The weight functions wgtgoal and wgtfail assign a weight between [math] and 2k to each state-action pair.
EM,sinitmax(X)>ϑ if and only if there is an n with un<0.
From this MDP M, we now construct a labeled MDP K. For each state-action pair (s,α) of M with s∈{goal,fail}, we add a chain
rs,α1,…,rs,α4k of new states as depicted in Figure 15: We redirect the transition from s when choosing α to this chain
by setting PK(s,α,rs,α1). In the states of the chain only one action τ is enabled. The process moves through the chain with probability 1 via this action, i.e., PK(rs,αi,τ,rs,αi+1)=1 for all i<4k. Then, the original
transition is taken from the state rs,α4k by setting PK(rs,α4k,τ,t)=P(s,α,t) for all states t of M.
Instead of making goal and fail absorbing, we furthermore add transitions back to the initial state sinit from goal and fail with probability 1.
Note that the expected time from sinit until sinit is reached again from goal or fail is now T′=T(4k+1)+1 in K under any scheduler.
The labeling is defined as follows:
All states except for goal and fail are labeled with a. States goal and fail are labeled with their names.
Furthermore, in each of the chains rs,α1,…,rs,α4k, the first wgtgoal(s,α) of the states are labeled with b in addition to the label a. The next wgtfail(s,α) states are labeled with c in addition to the a. As wgtgoal(s,α)+wgtfail(s,α)≤4k, this is always possible.
Consider now a path π of M from sinit to goal or fail. There is a unique corresponding path π^ in K.
The counters induced by the DFA D as described above now behave exactly like the accumulation weight functions wgtgoal and wgtfail. The value c1 counting the number of runs in state q1 of D is precisely wgtgoal(π) when entering goal or fail as in each chain of states rs,α1,…,rs,α4k exactly wgtgoal(s,α)-many runs of D enter state q1. The counter c2 behaves analogously in terms of the weight function wgtfail.
As all states not in {goal,fail} are labeled with a, the counters are also not reset.
When entering goal, the random variable X assigns weight wgtgoal(π) to the path π. The same weight is received from the counter c1 in this case. When entering fail, weight wgtfail(π) is assigned by X and received from the counters.
As the time required to reach sinit again from goal or fail in K in expectation is T′ under any scheduler, a scheduler maximizing the expected mean payoff in KD, i.e., according to the weight function induced by the counter for D, hence has to maximize the expected value of X when considered as a scheduler for M. By Theorem 4.21, the maximal mean payoff in KD
equals the maximal long-run probability LPK,sinitmax(D). Putting these results together, we obtain that
[TABLE]
if and only if the given linear recurrence sequence is eventually negative.
We conclude the Positivity-hardness result for long-run probabilities.
Note that the Positivity-hardness holds for the fixed simple DFA D.
Theorem 4.22**.**
There is a fixed DFA D such that
the Positivity problem is polynomial-time reducible to the following problem:
Given an MDP M and a rational χ, decide whether LPMmax(D)>χ.
Frequency-LTL.
A consequence of this result is that model checking of frequency-LTL in MDPs is at least as hard as the Positivity problem. Whether the model-checking problem for the full logic frequency-LTL is decidable has been raised as an open question in [FK15, FKK15]. Our result now shows that proving decidability of this model-checking problem would render the Positivity problem decidable as well.
The frequency-globally modality Ginf>ϑ(φ) of frequency-LTL is defined to hold on a path π iff
[TABLE]
i.e. iff the long-run frequency of φ exceeds ϑ.
Theorem 4.23**.**
There is a polynomial-time reduction from the Positivity problem to the following qualitative model checking problem for frequency-LTL for a fixed LTL-formula φ: Given an MDP M and a rational ϑ, is PrMmax(Ginf>ϑ(φ))=1?
Proof.
Consider the MDP K, the DFA D, and the threshold ϑ′=ϑ/T′ constructed above. As the sets of states labeled with b and with c are disjoint and included in the set of states labeled with a, and likewise the sets of states labeled with a, goal, and fail are pairwise disjoint in K,
a path of K has a prefix accepted by D if and only if the path satisfies
[TABLE]
We claim that there is a scheduler S with LPK,sinitS(D)>ϑ′
if and only if there is a scheduler T such that Ginf>ϑ(φ) holds with probability 1 under T in K.
Suppose there is a scheduler with S with LPKS(D)>ϑ′.
By Lemma 4.20, we can assume that S is a finite-memory scheduler as the maximal long-run probability can be approximated by finite-memory schedulers. As K is strongly connected, we can further assume that S induces only one BSCC. We claim that under this scheduler S also Ginf>ϑ(φ) holds with probability 1. For finite-memory schedulers, it is easy to check that the expected long-run probability equals the expected long-run frequency as we obtain a finite-state Markov chain: Let xs be the steady state probability of states s enriched with memory modes in the single BSCC BS induced by S.
Further, let ps be the probability that a run starting in s under S satisfies φ.
Then, LPKS(D)=∑s∈BSxs⋅ps. But the same expression also computes the expected frequency with which φ holds on suffixes as shown in [FK15].
Furthermore, in a strongly connected Markov chain, the frequency of φ along almost all paths agrees with the expected frequency (see [FK15]). So,
[TABLE]
holds on almost all paths ζ.
Conversely, if there is a scheduler T such that Ginf>ϑ(φ) holds with probability 1 under T in K, the expected value EKS(liminfn→∞n+11∑i=0n\mathds1ς[i…]⊨φ)>ϑ. By an argument using Fatou’s lemma analogously to the proof of Lemma 4.20 in [Pir21], we can find a finite memory scheduler with expected long-run frequency, and hence long-run probability, greater than ϑ.
∎
4.4 Conditional value-at-risk for accumulated weights
Lastly, we aim to prove the Positivity-hardness of the threshold problem for the CVaR in this section.
To this end, we provide a further direct reduction from the Positivity-problem to the threshold problem for the expected value of an auxiliary random variable closely related to the CVaR using our MDP-gadgets.
Conditional Value-at-Risk.
Given an MDP M=(S,Act,P,sinit,wgt,Goal) with a scheduler S, a random variable X defined on runs of the MDP with values in R and a value p∈[0,1], we define the value-at-risk as
VaRpS(X)=sup{r∈R∣PrMS(X≤r)≤p}.
So, the value-at-risk is the point at which the cumulative distribution function of X reaches or exceeds p.
The conditional value-at-risk is now the expectation of X under the condition that the outcome belongs to the p worst outcomes – in this case, the p lowest outcomes.
Denote VaRpS(X) by v.
Following the treatment of random variables that are not continuous in general in [KM18], we define the conditional value-at-risk as follows:
[TABLE]
Outcomes of X which are less than v are treated differently to outcomes equal to v as it is possible that the outcome v has positive probability and we only want to
account exactly for the p worst outcomes. Hence, we take only p−PrMS(X<v) of the outcomes which are exactly v into account as well.
To provide worst-case guarantees or to find risk-averse policies, we are interested in the maximal and minimal conditional value-at-risk
[TABLE]
In our formulation here, low outcomes are considered to be bad. Completely analogously, one can define the conditional value-at-risk for the highest p outcomes.
The main result of the section is the following:
Theorem 4.24**.**
The Positivity problem is polynomial-time reducible to the following problem:
Given an MDP M and rationals ϑ and p∈(0,1), decide whether
[TABLE]
We will use an auxiliary optimization problem to prove this result.
We begin with the following consideration: Given an MDP M with initial state sinit,
we construct a new MDP N.
We add a new initial state sinit′. In sinit′, there is only one action with weight [math] enabled
leading to sinit with probability 31 and to goal with probability 32. So, at least two thirds of the paths accumulate weight [math] before reaching the goal.
Hence, we can already say that VaR1/2S(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal)=0 in N under any scheduler S. Note that schedulers for M can be seen as schedulers for N and vice versa.
This considerably simplifies the computation of the conditional value-at-risk in N. Define the random variable \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal on paths ζ by
[TABLE]
Now, the conditional value-at-risk for the probability value 1/2 under a scheduler S in N is given by
CVaR1/2S(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal)=2⋅EN,sinitS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal)=32⋅EM,sinitS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal).
So, the result follows from the following lemma:
Lemma 4.25**.**
The Positivity problem is polynomial-time reducible to the following problem:
Given an MDP M and a rational ϑ, decide whether EM,sinitmax(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal)>ϑ.
Proof.
The first important observation is that the optimal expectation e(q,w) of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal for different starting states q and starting weights w satisfies equation (∗) from Section 3.2, i.e.,
e(q,w)=∑r∈SP(q,α,r)⋅e(r,w+wgt(q,α)) if an optimal scheduler chooses actions α in state q=goal when the accumulated weight is w. The value e(goal,w) is w if w≤0 and [math] otherwise.
This allows us to reuse the gadget Gαˉ to encode a linear recurrence relation.
We again adjust the gadget encoding the initial values of a linear recurrence sequence.
So, let k be a natural number, α1,…,αk be rational coefficients of a linear recurrence sequence, and β0,…,βk−1≥0 the rational initial values. W.l.o.g. we again assume these values to be small using Assumption 3.1, namely: ∑1≤i≤k∣αi∣≤5(k+1)1 and for all j, βj≤31α where α=∑1≤i≤k∣αi∣.
The new gadget that encodes the initial values of a linear recurrence sequence is depicted in Figure 16. In states t and s, there is a choice between actions γj and δj, respectively, for 0≤j≤k−1.
After glueing together this gadget with the gadget Gαˉ at states t, s, and goal,
we prove that the interplay between the gadgets is correct:
Let 0≤j≤k−1. Starting with accumulated weight −k+j in state t, the action γj maximizes the partial expectation among the actions γ0,…,γk−1. Likewise, δj is optimal when starting in s with weight −k+j. If the accumulated weight is non-negative in state s or t, then γ or δ are optimal. The idea is that for positive starting weights, the tail loss of γi and δi is relatively high while for weights just below [math], the chance to reach goal with positive weight again outweighs this tail loss.
First, we estimate the expectation of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal when choosing δi and δ while the accumulated weight is −k+j in s.
If i>j, then δi and δ lead to goal directly with probability 1−α and weight ≤−1.
So, the expectation is less than −(1−α)≤−1+5(k+1)1.
If i≤j, then with probability 1−α goal is reached with positive weight, hence \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal is [math] on these paths.
With probability βi, goal is reached via yj′. In this case all runs reach goal with negative weight. On the way to yj′ weight 2k is added, but afterwards subtracted again at least once.
In expectation weight 2k is subtracted kk+1 many times. Furthermore, −2k+i is added to the starting weight of −k+j. So, these paths contribute βi⋅(2k−2kkk+1−3k+j+i)=(−3k+j+i−2)⋅βi to the expectation of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal.
With analogous reasoning, we see that the remaining paths contribute (−3k+j+i−1)⋅(α−βi). So, all in all the expectation of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal in this situation is α⋅(−3k+j+i−1)−βi.
Now, as α≤5(k+1)1 and βi≤3α for all i,
we see that α⋅(−3k+j+i−1)−βi≥−(3k+2)α≥−1+5(k+1)1. The optimum with i≤j is obtained for i=j as βi≤α/3 for all i.
Hence
indeed δj is the optimal action. For γj the same proof with βi=0 for all i leads to the same result.
Now assume that the accumulated weight in t or s is ℓ≥0.
Then, all actions lead to goal with a positive weight with probability 1−α.
In this case \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal is [math].
However, a scheduler S which always chooses γ and δ is better than a scheduler choosing γj or δj for any j≤k−1.
Under scheduler S starting from s or t a run returns to {s,t} with probability α while accumulating weight
≥−k and the process is repeated. After choosing γj or δj the run moves to xj, yj or yj′ while accumulating a negative weight. From then on, in each step it will stay in that state with probability greater than α and accumulate weight ≤−k. Hence, the expectation of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal is lower under γj or δj than under S. Therefore indeed γ and δ are the best actions for non-negative accumulated weight in states s and t.
Let now e(t,w) and e(s,w) denote the optimal expectations of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal when starting in t or s with weight w.
Further, let d(w)=e(t,w)−e(s,w).
From the argument above, we also learn that
the difference d(−k+j) is equal to βj, for 0≤j≤k−1 .
Put together with the linear recurrence encoded in Gαˉ this shows that d(−k+w)=uw for all w where (un)n∈N is the linear recurrence sequence specified by the αi, βj, 1≤i≤k, and 0≤j≤k−1.
Finally, we add the same initial component as in the previous section to obtain an MDP M. Let S be the scheduler always choosing τ in state c and afterwards following the optimal actions as described above is optimal iff the linear recurrence sequence stays non-negative.
The remaining argument goes completely analogously to the proof of Theorem 4.1.
Grouping together the optimal values in vectors vn with 2k entries as done there, we can use the same Markov chain as in that proof to obtain a matrix A such that vn+1=Avn. This allows us to compute the rational value ϑ=EM,sinitS(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal) via a matrix series in polynomial time and
EM,sinitmax(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal)>ϑ if and only if the given linear recurrence sequence is eventually negative.
∎
By the discussion above, this lemma directly implies Theorem 4.24.
With adaptions similar to the previous section, it is possible to obtain the analogous result for the minimal expectation of \leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal. This implies that also the threshold problem whether the minimal
conditional value-at-risk is less than a threshold ϑ, CVaRpmin(\leavevmodeto12.69pt\vboxto14.42pt\pgfpicture\makeatletter\lower-7.208ptto0.0pt\pgfsys@beginscope\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@setlinewidth0.4pt\pgfsys@invoke \nullfontto0.0pt\pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.0-3.01389pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.03.01385pt0.0pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt-3.87498pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@beginscope\pgfsys@invoke \pgfsys@beginscope\pgfsys@invoke \pgfsys@transformcm1.00.00.01.00.0pt3.87495pt\pgfsys@invoke \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@invoke \pgfsys@color@rgb@fill000\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto3.01385pt0.0pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@lineto0.0pt3.87495pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto0.0pt3.87495pt\pgfsys@lineto0.0pt-3.87498pt\pgfsys@stroke\pgfsys@invoke \pgfsys@moveto3.01385pt0.0pt\pgfsys@lineto-3.01389pt0.0pt\pgfsys@stroke\pgfsys@invoke \pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@invoke\lxSVG@closescope\pgfsys@endscope\hss\lxSVG@closescope\endpgfpicturegoal)<ϑ, is Positivity-hard.
5 Conclusion
The Positivity-hardness results established in this paper show that a series of problems on finite-state MDPs that have been studied and left open in the literature exhibit an inherent mathematical difficulty.
A decidability result for any of these problems would imply a major break-through in analytic number theory.
At the heart of our Positivity-hardness proofs lies the construction of modular MDPs consisting of three gadgets. This construction provides a versatile proof strategy to establish Positivity-hardness results: It allowed us to provide three direct reductions from the Positivity problem by constructing structurally identical MDPs that only differ in the gadget encoding the initial values. The further chains of reductions depicted in Figure 1 established Positivity-hardness for a landscape of different problems. These problems range from problems on one-counter MDPs and integer-weighted MDPs also to problems concerning the long-run satisfaction of path properties, namely the threshold problem for long-run probabilities and the model-checking problem of frequency-LTL, which are on first sight of a rather different nature compared to the other problems.
The proof technique might be applicable to further threshold problems associated to optimization problems on MDPs. A main requirement for the direct applicability of the technique is that the optimal values V(s,w) in terms of the current state s and the weight w accumulated so far, or a similar quantity that can be increased and decreased, satisfy an optimality equation of the form
[TABLE]
In addition, the optimum must not be achievable with memoryless schedulers, but the optimal decisions have to depend on the accumulated weight to make it possible to encode initial values of a linear recurrence sequence. This combination of conditions is quite common as we have seen.
Furthermore, our and possible future Positivity-hardness results might be transferrable to further notions resulting from taking long-run averages (as in the case of long-run probabilities and frequency-LTL) or conditioning (as in the case of conditional expectations and conditional values-at-risk).
In the special case of Markov chains, several of the investigated problems are decidable: In Markov chains, partial and conditional expectations and long-run probabilities and frequencies can easily be computed.
Furthermore, one-counter Markov chains constitute a special case of recursive Markov chains, for which the threshold problem for the termination probability can be decided in polynomial space [EY09].
Remarkably however, the threshold problem for the probability that the accumulated cost satisfies a Boolean combination of inequality constraints in finite-state Markov chains is open [HKL17].
Finally, the Positivity-hardness results of course leave the possibility open that
some or all of the problems we studied are in fact harder than the Positivity problem. In particular, it could be the case that the problems are undecidable and that a proof of the undecidability would yield no implications for the Positivity problem.
For this reason, investigating whether some or all of the threshold problems are reducible to the Positivity problem constitutes a very interesting – and challenging – direction for future work.
Such an inter-reducibility result would show that studying any of the discussed optimization problems on MDPs could be a worthwhile direction of research to settle the decidability status of the Positivity-problem.
Some hope for an inter-reducibility result can be drawn from the fact that the optimal values are approximable for several of the problems – for termination probabilities and expected termination times of one-counter MDPs, this was shown in [BBEK11, BKNW12] and for partial and conditional expectations in [PB19]. This indicates that there is at least a major difference to undecidable problems in a similar context such as the emptiness problem for probabilistic finite automata where the optimal value cannot be approximated [Paz71, CL89].