Probabilistic Motion Planning under Temporal Tasks and Soft Constraints

Meng Guo; Michael M. Zavlanos

arXiv:1706.05209·cs.RO·October 24, 2017

Probabilistic Motion Planning under Temporal Tasks and Soft Constraints

Meng Guo, Michael M. Zavlanos

PDF

Open Access 1 Repo

TL;DR

This paper presents a probabilistic motion planning framework for mobile robots under uncertainty, optimizing task satisfaction probability and cost, with novel multi-objective control synthesis and validation through simulations and experiments.

Contribution

It introduces a multi-objective optimization approach using coupled Linear Programs for probabilistic task satisfaction and cost minimization, including a new algorithm for zero-probability scenarios.

Findings

01

Outperforms Round-Robin policies in trajectory suffix

02

Provides guarantees on probabilistic satisfaction and cost optimality

03

Includes a new control synthesis method for zero-probability cases

Abstract

This paper studies motion planning of a mobile robot under uncertainty. The control objective is to synthesize a {finite-memory} control policy, such that a high-level task specified as a Linear Temporal Logic (LTL) formula is satisfied with a desired high probability. Uncertainty is considered in the workspace properties, robot actions, and task outcomes, giving rise to a Markov Decision Process (MDP) that models the proposed system. Different from most existing methods, we consider cost optimization both in the prefix and suffix of the system trajectory. We also analyze the potential trade-off between reducing the mean total cost and maximizing the probability that the task is satisfied. The proposed solution is based on formulating two coupled Linear Programs, for the prefix and suffix, respectively, and combining them into a multi-objective optimization problem, which provides…

Figures26

Click any figure to enlarge with its caption.

Tables5

Table 1. Table I: Statistics of 1000 1000 1000 Monte Carlo simulations of 500 500 500 time steps, under different γ 𝛾 \gamma for task ( 18 ).

$γ$	Total Cost	Failure	Success	Unfinished
0	132.2	0	910	90
0.1	118.1	99	872	29
0.2	110.5	219	770	11
0.3	104.6	308	692	0
0.4	98.3	417	583	0

Table 2. Table II: Size and computation time of various models ℳ ℳ \mathcal{M} as described in Section V-E under task ( 20 ). The notation a e b ≜ a × 10 b ≜ a e b a superscript 10 b \texttt{a}\text{e}\texttt{b}\triangleq\texttt{a}\times 10^{\texttt{b}} for a , b > 0 a b 0 \texttt{a},\texttt{b}>0 . The size of ℳ ℳ \mathcal{M} , 𝒜 φ subscript 𝒜 𝜑 \mathcal{A}_{\varphi} and 𝒫 𝒫 \mathcal{P} includes the number of states and transitions. The size of LP problems ( 13 ) which contains ( 8 ) and ( 11 ) includes the number of rows, columns and variables in the linear equations, as indicated by the “ Gurobi ” solver [ 36 ] .

$ℳ$		$𝒫$		AMECs $Ξ_{a c c}$		$𝝅^{⋆}$ via (13)
Size	Time [ $s$ ]	Size	Time [ $s$ ]	Size	Time [ $s$ ]	Size of (8)	Size of (11)	Time to solve (13) [ $s$ ]
(100, 816)	0.13	(4.2e3, 4.1e4)	16.3	1.2e3	4.15	(443, 2.0e3, 8.3e3)	(1.2e3, 4.9e3, 2.1e4)	0.21
(324, 2.8e3)	1.69	(1.1e4, 1.0e5)	41.2	3.6e3	29.4	(1.3e3, 6.3e3, 2.2e4)	(3.6e3, 1.7e4, 5.9e4)	0.72
(900, 8.4e3)	24.2	(2.9e4, 2.8e5)	106.8	1.0e4	337.1	(3.6e3, 1.7e4, 6.0e4)	(9.9e3, 4.8e4, 1.6e5)	16.74
(1.4e3, 1.3e4)	88.7	(4.7e4, 4.5e5)	391.7	1.6e4	1.1e3	(5.8e3, 2.8e4, 9.7e4)	(1.5e4, 7.7e4, 2.6e5)	20.81
(2.5e3, 2.4e4)	326.9	(8.1e4, 7.8e5)	290.1	2.7e4	4.8e3	(1.0e4, 4.9e4, 1.6e5)	(2.7e4, 1.3e5, 4.5e5)	15.74
(3.3e3, 3.2e4)	558.3	(1.0e5, 1.1e6)	380.1	3.7e4	9.4e3	(1.3e4, 6.6e4, 2.2e5)	(3.7e4, 1.8e5, 6.1e5)	32.04

Table 3. Table III: The optimal prefix cost, suffix cost and the balanced cost as defined in ( 13 ) of task ( 20 ) under different β 𝛽 \beta with γ = 0 𝛾 0 \gamma=0 .

$β$	Prefix Cost	Suffix Cost		Balanced Cost by (13)
		Total	Mean
0	180.7	66.1	2.524	66.1
0.2	62.4	67.1	2.533	65.2
0.4	50.5	72.9	2.551	64.1
0.6	49.8	73.5	2.552	59.3
0.8	49.5	74.3	2.554	54.4
1.0	49.5	246.7	2.817	49.5

Table 4. Table IV: Statistics of 1000 1000 1000 Monte Carlo simulations under different γ prex subscript 𝛾 prex \gamma_{\texttt{prex}} and d 𝑑 d , for task ( 19 ) in Section V-E .

$γ_{prex}$	$d$	$γ_{sufx}$	Failure	Pre. Success	Suf. Success
0.1	$300$	0.05	106	894	852
0.2	$300$	0.05	169	831	785
0.3	$300$	0.05	318	682	650
0.4	$300$	0.05	409	591	549
0.1	$280$	0.85	888	901	117
0.1	$270$	0.98	997	903	4

Table 5. Table V: Size and computation time of various models ℳ ℳ \mathcal{M} under task ( 19 ) where no AECs exist in 𝒫 𝒫 \mathcal{P} . The notations are defined similarly as in Table II . In this case, the combined LP in ( 14 ) contains ( 8 ) and ( 12 ) instead.

$ℳ$		$𝒫$		ASCCs $Ω_{a c c}$		$𝝅^{⋆}$ via (14)
Size	Time [ $s$ ]	Size	Time [ $s$ ]	Size	Time [ $s$ ]	Size of (8)	Size of (12)	Time to solve (14) [ $s$ ]
(100, 816)	0.13	(1.0e3, 1.1e4)	0.9	3.1e2	0.66	(202, 920, 3.4e3)	(301, 1.4e3, 4.9e3)	0.45
(324, 2.8e3)	1.57	(2.9e3, 3.1e4)	3.39	9.8e2	1.84	(6.5e2, 3.1e3, 1.1e4)	(9.7e2, 4.7e3, 1.6e4)	2.41
(900, 8.4e3)	23.9	(7.7e3, 7.9e4)	7.04	2.7e3	5.09	(1.8e3, 8.7e3, 3.0e4)	(2.7e3, 1.3e4, 4.5e4)	9.89
(1.4e3, 1.3e4)	92.2	(1.2e4, 1.2e5)	9.78	4.3e3	8.41	(2.9e3, 1.4e4, 4.9e4)	(4.3e3, 2.1e4, 7.2e4)	22.94
(2.5e3, 2.4e4)	322.1	(2.1e4, 2.1e5)	20.1	7.5e3	17.1	(5.1e3, 2.5e4, 8.5e4)	(7.5e3, 3.7e4, 1.3e5)	83.33
(3.3e3, 3.2e4)	625.2	(2.8e4, 2.9e5)	23.1	1.0e4	19.6	(6.7e3, 3.3e4, 1.1e5)	(1.0e4, 4.9e4, 1.7e5)	145.8

Equations58

M = (X, U, D, p_{D}, (x_{0}, l_{0}), A P, L, p_{L}, c_{D}),

M = (X, U, D, p_{D}, (x_{0}, l_{0}), A P, L, p_{L}, c_{D}),

Cost (R_{\infty}) = lim n \to \infty in f \frac{1}{n} t = 0 \sum n c_{D} (x_{t}, u_{t}),

Cost (R_{\infty}) = lim n \to \infty in f \frac{1}{n} t = 0 \sum n c_{D} (x_{t}, u_{t}),

P r_{M}^{μ} (R_{\infty}) = t = 0 \prod T p_{D} (x_{t}, u_{t}, x_{t + 1}) \cdot p_{L} (x_{t}, l_{t}) \cdot μ_{t} (R_{t}, u_{t}),

P r_{M}^{μ} (R_{\infty}) = t = 0 \prod T p_{D} (x_{t}, u_{t}, x_{t + 1}) \cdot p_{L} (x_{t}, l_{t}) \cdot μ_{t} (R_{t}, u_{t}),

P r_{M}^{μ} (φ) = P r_{M}^{μ} {R_{\infty} ∣ L_{\infty} ⊨ φ},

P r_{M}^{μ} (φ) = P r_{M}^{μ} {R_{\infty} ∣ L_{\infty} ⊨ φ},

μ \in \overline{μ} min E_{M}^{μ} {Cost (R_{\infty})} s . t . Risk_{M}^{μ} (φ) \leq γ,

μ \in \overline{μ} min E_{M}^{μ} {Cost (R_{\infty})} s . t . Risk_{M}^{μ} (φ) \leq γ,

P = (S, U, E, p_{E}, c_{E}, s_{0}, Acc_{P}),

P = (S, U, E, p_{E}, c_{E}, s_{0}, Acc_{P}),

p_{E}\big{(}\langle x,l,q\rangle,\,u,\,\langle\check{x},\check{l},\check{q}\rangle\big{)}=p_{D}(x,\,u,\,\check{x})\cdot p_{L}(\check{x},\,\check{l})

p_{E}\big{(}\langle x,l,q\rangle,\,u,\,\langle\check{x},\check{l},\check{q}\rangle\big{)}=p_{D}(x,\,u,\,\check{x})\cdot p_{L}(\check{x},\,\check{l})

\begin{split}&\underset{\boldsymbol{\pi}\in\overline{\boldsymbol{\pi}}}{\boldsymbol{\min}}\;\;\bigg{[}\textbf{C}_{\texttt{pre}}(S_{c})\triangleq\;\mathbb{E}^{\boldsymbol{\pi}}_{\mathcal{Z}_{\texttt{pre}}}\Big{\{}\sum_{t=0}^{\infty}c_{\texttt{p}}(s_{t},u_{t})\Big{\}}\bigg{]}\\ &\textrm{s.t.}\;\;{Pr}_{s_{0}}^{\boldsymbol{\pi}}(\Diamond S_{c})\geq 1-\gamma,\end{split}

\begin{split}&\underset{\boldsymbol{\pi}\in\overline{\boldsymbol{\pi}}}{\boldsymbol{\min}}\;\;\bigg{[}\textbf{C}_{\texttt{pre}}(S_{c})\triangleq\;\mathbb{E}^{\boldsymbol{\pi}}_{\mathcal{Z}_{\texttt{pre}}}\Big{\{}\sum_{t=0}^{\infty}c_{\texttt{p}}(s_{t},u_{t})\Big{\}}\bigg{]}\\ &\textrm{s.t.}\;\;{Pr}_{s_{0}}^{\boldsymbol{\pi}}(\Diamond S_{c})\geq 1-\gamma,\end{split}

\displaystyle\underset{\{y_{s,u}\}}{\boldsymbol{\min}}\bigg{[}\textbf{C}_{\texttt{pre}}(S_{c})\triangleq\sum_{(s,u)}\sum_{\check{s}\in S_{\texttt{p}}}y_{s,u}\,p_{\texttt{p}}(s,u,\check{s})\,c_{\texttt{p}}(s,u)\bigg{]}

\displaystyle\underset{\{y_{s,u}\}}{\boldsymbol{\min}}\bigg{[}\textbf{C}_{\texttt{pre}}(S_{c})\triangleq\sum_{(s,u)}\sum_{\check{s}\in S_{\texttt{p}}}y_{s,u}\,p_{\texttt{p}}(s,u,\check{s})\,c_{\texttt{p}}(s,u)\bigg{]}

s.t. (s, u) \sum \overset{s}{ˇ} \in S_{c} \sum y_{s, u} p_{p} (s, u, \overset{s}{ˇ}) \geq 1 - γ;

u \in U (\overset{s}{ˇ}) \sum y_{\overset{s}{ˇ}, u} = (s, u) \sum y_{s, u} p_{p} (s, u, \overset{s}{ˇ}) + \mathbbm 1 (\overset{s}{ˇ} = s_{0}), \forall \overset{s}{ˇ} \in S_{n};

y_{s, u} \geq 0, \forall s \in S_{n}, \forall u \in U (s),

\overline{C}_{suf} (P_{a}) ≜ t = 0 \sum N_{a} c_{D} (s_{t}, u_{t})

\overline{C}_{suf} (P_{a}) ≜ t = 0 \sum N_{a} c_{D} (s_{t}, u_{t})

C_{suf} (S_{c}^{'}, U_{c}^{'}) = E_{P_{a} \in P_{a}}^{π} {C_{suf} (P_{a})},

C_{suf} (S_{c}^{'}, U_{c}^{'}) = E_{P_{a} \in P_{a}}^{π} {C_{suf} (P_{a})},

y_{0} (s) = \overset{s}{ˇ} \in S_{n}^{'} \sum u \in U_{p} (\overset{s}{ˇ}) \sum p_{p} (\overset{s}{ˇ}, u, s) y_{pre} (\overset{s}{ˇ}, u), \forall s \in (S_{c}^{'} \ I_{c}^{'}) \cup I_{out},

y_{0} (s) = \overset{s}{ˇ} \in S_{n}^{'} \sum u \in U_{p} (\overset{s}{ˇ}) \sum p_{p} (\overset{s}{ˇ}, u, s) y_{pre} (\overset{s}{ˇ}, u), \forall s \in (S_{c}^{'} \ I_{c}^{'}) \cup I_{out},

\displaystyle\underset{\{z_{s,u}\}}{\boldsymbol{\min}}\bigg{[}\textbf{C}_{\texttt{suf}}(S_{c}^{\prime},U_{c}^{\prime})\triangleq\sum_{(s,u)}\sum_{\check{s}\in S_{\texttt{e}}}z_{s,u}\,p_{\texttt{e}}(s,u,\check{s})\,c_{\texttt{e}}(s,u)\bigg{]}

\displaystyle\underset{\{z_{s,u}\}}{\boldsymbol{\min}}\bigg{[}\textbf{C}_{\texttt{suf}}(S_{c}^{\prime},U_{c}^{\prime})\triangleq\sum_{(s,u)}\sum_{\check{s}\in S_{\texttt{e}}}z_{s,u}\,p_{\texttt{e}}(s,u,\check{s})\,c_{\texttt{e}}(s,u)\bigg{]}

s.t. (s, u) \sum \overset{s}{ˇ} \in I_{in} \sum z_{s, u} p_{e} (s, u, \overset{s}{ˇ}) = s \in S_{e}^{'} \sum y_{0} (s);

u \in U_{e} (s) \sum z_{s, u} = (\overset{s}{ˇ}, u) \sum z_{\overset{s}{ˇ}, u} p_{e} (\overset{s}{ˇ}, u, s) + y_{0} (s), \forall s \in S_{e}^{'};

z_{s, u} \geq 0, \forall s \in S_{e}^{'}, \forall u \in U_{e} (s);

y_{0} (s) = (\overset{s}{ˇ}, u) \sum p_{p} (\overset{s}{ˇ}, u, s) y_{prex} (\overset{s}{ˇ}, u), \forall s \in (S_{c}^{'} \ I_{c}^{'}) \cup I_{out},

y_{0} (s) = (\overset{s}{ˇ}, u) \sum p_{p} (\overset{s}{ˇ}, u, s) y_{prex} (\overset{s}{ˇ}, u), \forall s \in (S_{c}^{'} \ I_{c}^{'}) \cup I_{out},

\displaystyle\underset{\{z_{s,u}\}}{\boldsymbol{\min}}\bigg{[}\text{C}_{\texttt{sufx}}(S_{c}^{\prime},d)\triangleq\sum_{(\check{s},u)}\Big{(}\sum_{s\in S_{\texttt{r}}^{\prime\prime}}\eta(\check{s},u,s)\,c_{\texttt{r}}(\check{s},u)

\displaystyle\underset{\{z_{s,u}\}}{\boldsymbol{\min}}\bigg{[}\text{C}_{\texttt{sufx}}(S_{c}^{\prime},d)\triangleq\sum_{(\check{s},u)}\Big{(}\sum_{s\in S_{\texttt{r}}^{\prime\prime}}\eta(\check{s},u,s)\,c_{\texttt{r}}(\check{s},u)

\displaystyle\quad\qquad\qquad\qquad\qquad\qquad\qquad+\eta(\check{s},u,s_{bad})\,d\Big{)}\bigg{]}

s.t. u \in U_{r} (s) \sum z_{s, u} = (\overset{s}{ˇ}, u) \sum η (\overset{s}{ˇ}, u, s) + y_{0} (s), \forall s \in S_{r}^{'};

\displaystyle\sum_{(\check{s},u)}\;\bigg{(}\sum_{s\in I_{\texttt{in}}}\eta(\check{s},u,s)+\eta(\check{s},u,s_{bad})\bigg{)}=\sum_{s\in S_{\texttt{r}}^{\prime}}y_{0}(s);

z_{s, u} \geq 0, \forall s \in S_{r}^{'}, \forall u \in U_{r} (s);

{y_{s, u}, z_{s, u}} min β \cdot C_{pre} (S_{c}) + (1 - β) (S_{c}^{'}, U_{c}^{'}) \in Ξ_{a cc} \sum C_{suf} (S_{c}^{'}, U_{c}^{'}),

{y_{s, u}, z_{s, u}} min β \cdot C_{pre} (S_{c}) + (1 - β) (S_{c}^{'}, U_{c}^{'}) \in Ξ_{a cc} \sum C_{suf} (S_{c}^{'}, U_{c}^{'}),

{y_{s, u}, z_{s, u}} min β \cdot C_{prex} (S_{c}) + (1 - β) S_{c}^{'} \in Ω_{a cc} \sum C_{sufx} (S_{c}^{'}, d),

{y_{s, u}, z_{s, u}} min β \cdot C_{prex} (S_{c}) + (1 - β) S_{c}^{'} \in Ω_{a cc} \sum C_{sufx} (S_{c}^{'}, d),

κ (s_{d}, u) ≜ \overset{s}{ˇ} \in S_{c} \cup S_{n} \sum \frac{D ( l , χ ( q , q ˇ ))}{∣ χ ( q , q ˇ ) ∣} \cdot p_{E} (x, u, \overset{x}{ˇ}) \cdot p_{L} (\overset{x}{ˇ}, \overset{ˇ}{l}),

κ (s_{d}, u) ≜ \overset{s}{ˇ} \in S_{c} \cup S_{n} \sum \frac{D ( l , χ ( q , q ˇ ))}{∣ χ ( q , q ˇ ) ∣} \cdot p_{E} (x, u, \overset{x}{ˇ}) \cdot p_{L} (\overset{x}{ˇ}, \overset{ˇ}{l}),

π^{⋆} (s_{d}, u) = {10 for \leavevmode u = argmin_{u \in U (s_{d})} κ (s_{d}, u); other u \in U (s_{d}),

π^{⋆} (s_{d}, u) = {10 for \leavevmode u = argmin_{u \in U (s_{d})} κ (s_{d}, u); other u \in U (s_{d}),

μ^{⋆} (X_{t}, L_{t}) = π^{⋆} (s_{t}),

μ^{⋆} (X_{t}, L_{t}) = π^{⋆} (s_{t}),

P r (s_{t} \in / S_{d}, \forall t \in [0, T]) \geq (1 - γ_{prex}) \cdot (1 - γ_{sufx} (d))^{N_{s}},

P r (s_{t} \in / S_{d}, \forall t \in [0, T]) \geq (1 - γ_{prex}) \cdot (1 - γ_{sufx} (d))^{N_{s}},

φ_{1} = (◊ (b1 \land ◊ (b2 \land ◊ b3))) \land (□ \neg Obs) \land (◊□ b3) .

φ_{1} = (◊ (b1 \land ◊ (b2 \land ◊ b3))) \land (□ \neg Obs) \land (◊□ b3) .

φ_{2} = (□◊ b1) \land (□◊ b2) \land (□◊ b3) \land (□ \neg Obs) .

φ_{2} = (□◊ b1) \land (□◊ b2) \land (□◊ b3) \land (□ \neg Obs) .

φ = φ_{all_base} \land φ_{order} \land (□ \neg Obs),

φ = φ_{all_base} \land φ_{order} \land (□ \neg Obs),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MengGuo/P_MDP_TG
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Formal Methods in Verification · Reinforcement Learning in Robotics

Full text

Probabilistic Motion Planning under Temporal Tasks

and Soft Constraints

Meng Guo, and Michael M. Zavlanos The authors are with the Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC 27708 USA. Emails: meng.guo, [email protected]. This work is supported in part by NSF under grant IIS #1302283.

Abstract

This paper studies motion planning of a mobile robot under uncertainty. The control objective is to synthesize a finite-memory control policy, such that a high-level task specified as a Linear Temporal Logic (LTL) formula is satisfied with a desired high probability. Uncertainty is considered in the workspace properties, robot actions, and task outcomes, giving rise to a Markov Decision Process (MDP) that models the proposed system. Different from most existing methods, we consider cost optimization both in the prefix and suffix of the system trajectory. We also analyze the potential trade-off between reducing the mean total cost and maximizing the probability that the task is satisfied. The proposed solution is based on formulating two coupled Linear Programs, for the prefix and suffix, respectively, and combining them into a multi-objective optimization problem, which provides provable guarantees on the probabilistic satisfiability and the total cost optimality. We show that our method outperforms relevant approaches that employ Round-Robin policies in the trajectory suffix. Furthermore, we propose a new control synthesis algorithm to minimize the frequency of reaching a bad state when the probability of satisfying the tasks is zero, in which case most existing methods return no solution. We validate the above schemes via both numerical simulations and experimental studies.

Index Terms:

Markov Decision Process, Linear Temporal Logic, Chance Constrained Optimization, Motion Planning.

I Introduction

In this paper we study the problem of robot motion planning under uncertainty and temporal task specifications. We consider uncertainty in the workspace properties, robot motion and actions, and outcome of task executions, which gives rise to a Markov Decision Process (MDP) to model the proposed system. MDPs have been used extensively to model motion and sensing uncertainty in robotics [1, 2] and then solve decision making problems that optimize a given control objective. The most common objective is to reach a goal state from an initial state while minimizing the cost. The resulting solution is a policy that maps states to actions [2]. On the other hand, Linear Temporal Logic (LTL) provides a formal language to describe complex high-level tasks beyond the classic start-to-goal navigation. A LTL task formula is usually specified with respect to an abstraction of the robot motion within the allowed workspace [3], modeled by a deterministic finite transition system (FTS). Then a high-level discrete plan is found using off-the-shelf model-checking algorithms [4], which is then executed through low-level continuous controllers [3, 5]. This framework is extended to allow for both robot motion and actions in the task specification [6] and partially-known or dynamic workspaces in [7, 8].

Recently, there have been many efforts to address the problem of synthesizing a control policy for a MDP that satisfies high-level temporal tasks specified in various formal languages. Different classes of Probabilistic Computation Tree Logic (PCTL) formulas have been studied in [9] for abstraction and verification over Interval-valued Markov Chains. The work in [10] proposes a control policy for a mobile robot that maximizes the probability of satisfying a bounded linear temporal logic (BLTL) formula. Syntactical co-safe LTL formulas (sc-LTL) are considered in [11] for a deterministic robot that co-exists with other robots whose behavior is modeled as a MDP. A FTS with time-varying rewards is controlled to satisfy a LTL formula and maximize the accumulated reward in [12]. A robust control policy for MDPs with uncertain transition probabilities is proposed in [8]. A verification toolbox is provided in [13] for probabilistic discrete-time or continuous-time Markov Chain (MC), under a wide variety of quantitative properties expressed in PCTL, LTL, CTL, and so on.

In this work, we study motion planning of a mobile robot under uncertainty in both robot motion and workspace properties. The goal is to synthesize a finite-memory control policy that generates robot trajectories that satisfy a high-level LTL task formula with desired high probability. At the same time, we optimize the total cost both in the prefix and suffix parts of the system trajectories. Our proposed approach is based on solving two coupled Linear Programs, one for the prefix and one for the suffix, over the occupancy measures of the product automaton introduced in [14]. Moreover, we explore cases where the probability of satisfying the LTL tasks is zero, so that an Accepting End Component (AEC) does not exist in the MDP, where most relevant work returns no solutions. To address such situations, we treat satisfaction of the tasks as soft constraints and propose a relaxed suffix plan that minimizes the frequency with which the system enters bad states that violate the task specifications. We show that our approach outperforms the widely-used Round-Robin policy, via both numerical simulations and experimental studies. We also compare our proposed method with the widely-used probabilistic model-checking tool PRISM [13].

Our work is related to literature on (i) policy synthesis for MDPs under multiple objectives; (ii) cost optimization within AECs in MDPs; and (iii) infeasible temporal tasks. We discuss below this literature and highlight our contributions.

Since we consider both temporal tasks and total-cost criteria over MDPs, this work is closely related to policy synthesis of MDPs under multiple objectives. The work in [14] proposes a framework with provable correctness to synthesize a control policy for MDPs under multiple constrained total-cost criteria. A survey on multi-objective decision-making for MDPs can be found in [15]. On the other hand, verification of MDPs under multiple high-level tasks is addressed in [16], where the probability of satisfying each subtask is lower-bounded by a given value. Moreover, a quantitative multi-objective verification scheme is proposed in [17, 18] for numerical queries over probabilistic reward predicates.

On the other hand, the seminal works [19, 20] consider MDPs with multi-dimensional weights under multi-percentile queries that may be conflicting. However, most of the above work does not address cost optimization over the suffix of the system trajectory within the AECs, neither does it address the case where no AECs can be found in the product automaton, which are the main contributions here.

The satisfaction of a LTL formula is associated with reaching the corresponding AECs. In particular, in [4, Chapter 10], a value iteration method is used to solve the maximal reachability problem towards the AECs to obtain a policy for the plan prefix. For planning within the AECs, [21, 17, 4] adopt the Round-Robin policy, which guarantees only correctness but not optimality. Optimal policies for the plan suffix that keeps the system within the AECs have been proposed in [22, 23, 24, 25]. Specifically, in [22] the expected cost of satisfying instances of a desired property is minimized, while in [23] the minimal bottleneck cost is considered. Both approaches in [22, 23] require particular types of LTL formulas (such as “always eventually”). The work in [24, 26] considers MDPs with $\omega$ -regular specifications and quantitative resource constraints within the AECs. The work in [25] investigates the Pareto cost of a human-in-the-loop MDP measured by a given discounted cost function. Compared to this literature, the multi-objective optimization problem that we formulate to solve the control synthesis problem allows us to explicitly characterize the trade-off between prefix and suffix optimality. We then extend this methodology to the case where no AECs can be found.

Most aforementioned work [19, 20, 27, 4, 17, 21, 22] relies on the assumption that the product automaton contains at least one AEC. However, in many situations this assumption does not hold so that the probability of satisfying the task under any policy is zero. In this case, it is still important to identify those policies that minimize the frequency with which the system will reach the bad states that violate the task specifications. Consequently, it is desirable to synthesize a policy with certain risk guarantees even when soft LTL tasks are considered that are only partially-feasible. To the best of our knowledge, there is no work on control synthesis for infeasible soft LTL task formulas defined on MDPs, especially when an AEC can not be found in the resulting product automaton. For deterministic transition systems, a framework for robot motion planning in partially-known workspaces is proposed in [7] that can handle soft LTL task formulas whose satisfiability is improved over time; a least-violating control strategy is synthesized in [28] for a set of LTL safety rules. In the case of MDPs, a relevant formulation is considered in [29] where a MDP is controlled to satisfy an $\omega$ -regular formula. A policy is proposed to ensure that the MDP enters a failure state relatively late in the prefix. However, a multi-objective criterion of the control policy, especially in the plan suffix, is not considered there. Also, recent work in [30] proposes an approach to increase the satisfaction probability by modifying the task formula which, however, only considers co-safe LTL formulas without cost optimization constraints.

In summary, the main contribution of this work is three-fold: (i) a framework that optimizes the total cost both in the plan prefix and suffix, while ensuring that the tasks are satisfied with a desired high probability; (ii) a new algorithm to synthesize the control policies that have a high probability of satisfying the task over long time intervals, for cases where an AEC does not exist; and (iii) a new method that allows the system to recover from bad states and continue the task.

The rest of the paper is organized as follows. Section II introduces necessary preliminaries. In Section III, we formalizes the considered problem. Section IV presents our solution in details, which includes four major parts. Section V demonstrates the feasibility of the results by numerical simulations. Section VI contains the experimental results. We conclude and discuss about future directions in Section VII.

II Preliminaries

II-A Transient MDP

A Markov Decision Process (MDP) is defined as a 6-tuple $\mathcal{M}\triangleq(X,\,U,\,D,\,p_{D},\,c_{D},\,x_{0})$ , where $X$ is the finite state space; $U$ is the finite control action space (with a slight abuse of notation, $U(x)$ also denotes the set of control actions allowed at state $x\in X$ ); $D=\{(x,u)\,|\,x\in X,\,u\in U(x)\}$ is the set of possible state-action pairs; $p_{D}:X\times U\times X\rightarrow{[0,1]}$ is the transition probability function so that $p_{D}(x,\,u,\,\check{x})$ is the transition probability from state $x$ to state $\check{x}$ via control action $u$ and $\sum_{\check{x}\in X}p_{D}(x,u,\check{x})=1$ , $\forall(x,\,u)\in D$ ; $c_{D}:D\rightarrow\mathbb{R}^{>0}$ that $c_{D}(x,\,u)$ is the cost of performing action $u\in U(x)$ at state $x\in X$ ; and $x_{0}\in X$ is the initial state. Denote by $Post(x,\,u)\triangleq\{\check{x}\in X\,|\,p_{D}(x,u,\check{x})>0\}$ , $\forall(x,\,u)\in D$ .

The above MDP evolves by taking an action $u\in U(x)$ associated with every state $x\in X$ . Denote by $R_{T}=x_{0}u_{0}x_{1}u_{1}\cdots x_{T}u_{T}$ the past run that is a sequence of previous states and actions up to time $T\geq 0$ . As defined in [2], a control policy $\boldsymbol{\mu}=\mu_{0}\mu_{1}\cdots$ is a sequence of decision rules $\mu_{t}$ at time $t\geq 0$ . A control policy is stationary if ${\mu}_{t}=\mu$ , $\forall t\geq 0$ , where $\mu$ can be randomized so that ${\mu}:X\times U\rightarrow[0,1]$ or deterministic so that $\mu:X\rightarrow U$ , $\forall t\geq 0$ . On the other hand, a policy is history dependent or finite-memory if ${\mu}_{t}:R_{t}\times U\rightarrow[0,1]$ , where $R_{t}$ is the past history until time $t\geq 0$ .

II-B End Components

A sub-MDP of $\mathcal{M}$ is a pair $(S,\,A)$ where $S\subseteq X$ and $A:S\rightarrow 2^{U}$ such that (i) $S\neq\emptyset$ , $\emptyset\neq A(s)\subseteq U(s)$ , $\forall s\in S$ ; (ii) $Post(s,\,u)\subseteq S$ , $\forall s\in S$ and $\forall u\in A(s)$ . An End Component (EC) of $\mathcal{M}$ is a sub-MDP $(S,\,A)$ such that the digraph $G_{(S,A)}$ induced by $(S,A)$ is strongly connected. An end component $(S,\,A)$ is called maximal if there is no other end component $(S^{\prime},\,A^{\prime})$ such that $(S,\,A)\neq(S^{\prime},\,A^{\prime})$ , $S\subseteq S^{\prime}$ and $A(s)\subseteq A^{\prime}(s)$ , $\forall s\in S$ . The set of Maximal End Components (MECs) of a MDP is finite and can be uniquely determined. The analysis of MECs would include each EC as a special case. We refer the readers to Definitions 10.116, 10.117 and 10.124 of [4] for details. Moreover, an Accepting MEC (AMEC) is an end component that satisfies certain accepting conditions such as the Streett and Robin conditions, which will be defined in the sequel. On the other hand, a Strongly Connected Component (SCC) of the digraph $G_{\mathcal{M}}$ induced by $\mathcal{M}$ is a set of states $S\subseteq X$ so that there exists a path in each direction between any pair of states in $S$ . Similarly, an Accepting SCC (ASCC) is a SCC that satisfies certain accepting conditions. Note that the main difference between a MEC $(S,A)$ and a SCC $S$ is that the SCC does not restrict the set of actions $U(s)$ that can be taken at each state $s\in S$ . In other words, there might be paths that start from any state within the SCC and end at states outside the SCC.

II-C LTL and DRA

The ingredients of a Linear Temporal Logic (LTL) formula are a set of atomic propositions $AP$ and several Boolean and temporal operators. Atomic propositions are Boolean variables that can be either true or false. A LTL formula is specified according to the syntax [4]: $\varphi\triangleq\top\;|\;p\;|\;\varphi_{1}\wedge\varphi_{2}\;|\;\neg\varphi\;|\;\bigcirc\varphi\;|\;\varphi_{1}\,\textsf{U}\,\varphi_{2},$ where $\top\triangleq\texttt{True}$ , $p\in AP$ , $\bigcirc$ (next), U (until) and $\bot\triangleq\neg\top$ . For brevity, we omit the derivations of other operators like $\square$ (always), $\Diamond$ (eventually), $\Rightarrow$ (implication). The semantics of LTL is defined over the set of infinite words over $2^{AP}$ . Intuitively, $p\in AP$ is satisfied on a word $w=w(1)w(2)\ldots$ if it holds at $w(1)$ , i.e., if $p\in w(1)$ . Formula $\bigcirc\,\varphi$ holds true if $\varphi$ is satisfied on the word suffix that begins in the next position $w(2)$ , whereas $\varphi_{1}\,\textsf{U}\,\varphi_{2}$ states that $\varphi_{1}$ has to be true until $\varphi_{2}$ becomes true. Finally, $\Diamond\,\varphi$ and $\square\,\varphi$ are true if $\varphi$ holds on $w$ eventually and always, respectively. We refer the readers to Chapter 5 of [4] for the full definition.

The set of words that satisfy a LTL formula $\varphi$ over $AP$ can be captured through a Deterministic Rabin Automaton (DRA) $\mathcal{A}_{\varphi}$ [4], defined as $\mathcal{A}_{\varphi}=(Q,\,2^{AP},\,\delta,\,q_{0},\,\text{Acc}_{\mathcal{A}})$ , where $Q$ is a set of states; $2^{AP}$ is the alphabet; $\delta\subseteq Q\times 2^{AP}\times{Q}$ is a transition relation; $q_{0}\in Q$ is the initial state; and $\text{Acc}_{\mathcal{A}}\subseteq 2^{Q}\times 2^{Q}$ is a set of accepting pairs, i.e., $\text{Acc}_{\mathcal{A}}=\{(H^{1}_{\mathcal{A}},I^{1}_{\mathcal{A}}),(H^{2}_{\mathcal{A}},I^{2}_{\mathcal{A}}),\cdots,(H^{N}_{\mathcal{A}},I^{N}_{\mathcal{A}})\}$ where $H^{i}_{\mathcal{A}},\,I^{i}_{\mathcal{A}}\subseteq Q$ , $\forall i=1,2,\cdots,N$ . An infinite run $q_{0}q_{1}q_{2}\cdots$ of $\mathcal{A}$ is accepting if there exists at least one pair $(H^{i}_{\mathcal{A}},\,I^{i}_{\mathcal{A}})\in\text{Acc}_{\mathcal{A}}$ such that $\exists n\geq 0$ , it holds $\forall m\geq n,\,q_{m}\notin H^{i}_{\mathcal{A}}$ and $\overset{\infty}{\exists}n\geq 0$ , $q_{n}\in I^{i}_{\mathcal{A}}$ , where $\overset{\infty}{\exists}$ stands for “existing infinitely many”. Namely, this run should intersect with $H^{i}_{\mathcal{A}}$ finitely many times while with $I^{i}_{\mathcal{A}}$ infinitely many times. There are translation tools [31] to obtain $\mathcal{A}_{\varphi}$ given $\varphi$ , which requires the process of translating firstly the LTL formula to the associated Nondeterministic Büchi Automaton (NBA), and then to the DRA with complexity $2^{2^{\mathcal{O}(n\log n)}}$ , where $n$ is the length of $\varphi$ . Our implementation of the Python interface for [31] can be found in [32]. Note that [31] allows for different levels of automata simplifications to be made regarding the size of $\mathcal{A}_{\varphi}$ , and a simplified automation may result in loss of optimality.

III Problem Formulation

III-A Mathematical Model

In order to model uncertainty in both the robot motion and the workspace properties, we extend the definition of a MDP from Section II-A to include probabilistic labels, as the probabilistically-labeled MDP:

[TABLE]

where $AP$ is a set of atomic propositions that capture the properties of interest in the workspace; $L:X\rightarrow 2^{2^{AP}}$ contains the set of property subsets that can be true at each state; and $p_{L}:X\times 2^{AP}\rightarrow{[0,\,1]}$ specifies the associated probability. Particularly, $p_{L}(x,\,l)$ denotes the probability that state $x\in X$ satisfies the set of propositions $l\subset AP$ . Note that $\sum_{l\in L(x)}p_{L}(x,\,l)=1$ , $\forall x\in X$ . Moreover, $(x_{0},\,l_{0})$ contains the initial state $x_{0}\in X$ and the initial label $l_{0}\in L(x_{0})$ , while the rest of the notations in (1) are the same as defined in Section II-A. The probabilistic labeling function provides a way to consider time-varying and dynamic workspace properties. Moreover, there is a LTL task formula $\varphi$ specified over the same set of atomic propositions $AP$ , as the desired behavior of $\mathcal{M}$ . We assume that the MDP $\mathcal{M}$ in (1) is fully-observable due to the following assumption.

Assumption 1.

At any stage $t\geq 0$ , the current robot state $x_{t}\in X$ and its label $l_{t}\in L(x_{t})$ are fully-observable. $\blacksquare$

While the robot is moving within the workspace, it is capable of sensing an actual property and determine the label of the state it is located at. At stage $T\geq 0$ , the robot’s past path is given by $X_{T}=x_{0}x_{1}\cdots x_{T}\in X^{(T+1)}$ , the past sequence of observed labels is given by $L_{T}=l_{0}l_{1}\cdots l_{T}\in(2^{AP})^{(T+1)}$ and the past sequence of control actions is $U_{T}=u_{0}u_{1}\cdots u_{T}\in U^{(T+1)}$ . It holds that $p_{D}(x_{t},u_{t},x_{t+1})>0$ and $p_{L}(x_{t},l_{t})>0$ , $\forall t\geq 0$ . These three sequences can be composed into the complete past run $R_{T}=x_{0}l_{0}u_{0}x_{1}l_{1}u_{1}\cdots x_{T}l_{T}u_{T}$ . Denote by $\boldsymbol{X}_{T}$ , $\boldsymbol{L}_{T}$ and $\boldsymbol{R}_{T}$ the set of all possible past sequences of states, labels, and runs up to stage $T$ . We set $T=\infty$ for infinite sequences.

Definition 1.

The mean total cost [33, 2] of an infinite robot run $R_{\infty}$ of $\mathcal{M}$ is defined as

[TABLE]

where $R_{\infty}=x_{0}l_{0}u_{0}x_{1}l_{1}u_{1}\cdots\in\boldsymbol{R}_{\infty}$ . $\blacksquare$

As discussed in [33, 24, 20, 2], the above mean total cost is called the mean-payoff function (or limit-average), where the “ $\lim$ ” operator is needed as the limit-average might not exist for some runs, see [33, 24, 34].

Our goal is to find a finite-memory policy for $\mathcal{M}$ , denoted by $\boldsymbol{\mu}=\mu_{0}\mu_{1}\cdots$ . The control policy at stage $t\geq 0$ is given by $\mu_{t}:\mathbf{R}_{t}\times U\rightarrow[0,1]$ , where $\mathbf{R}_{t}$ is the past run $R_{t}$ , $\forall t\geq 0$ . Denote by $\overline{\boldsymbol{\mu}}$ the set of all such policies. Given a control policy $\boldsymbol{\mu}\in\overline{\boldsymbol{\mu}}$ , the probability measure $\text{Pr}^{\mathcal{M}}_{\boldsymbol{\mu}}(\cdot)$ on the smallest $\sigma$ -algebra, over all possible infinite sequences $\boldsymbol{R}_{\infty}$ that contain $R_{T}$ , is the unique measure [4] by

[TABLE]

where $\mu(\boldsymbol{R}_{t},u_{t})$ is defined as the probability of choosing action $u_{t}$ given the past run $\boldsymbol{R}_{t}$ . Then we define the probability of $\mathcal{M}$ satisfying $\varphi$ under policy $\boldsymbol{\mu}$ by:

[TABLE]

where the satisfaction relation “ $\models$ ” is introduced in Section II-C, given an infinite word and a LTL formula. Accordingly, the risk is defined as the probability that the task formula $\varphi$ is not satisfied by $\mathcal{M}$ under the policy $\boldsymbol{\mu}$ , namely, $\textbf{Risk}_{{\mathcal{M}}}^{\boldsymbol{\mu}}(\varphi)=1-{Pr}_{{\mathcal{M}}}^{\boldsymbol{\mu}}(\varphi)$ .

Problem 1.

Given the labeled MDP $\mathcal{M}$ defined in (1) and the task specification $\varphi$ , our goal is to sovle:

[TABLE]

where ${\gamma\geq 0}$ is a pre-defined parameter as the allowed risk; the optimal policy minimizes the mean total cost and ensures that the risk of violating $\varphi$ remains bounded by $\gamma$ . $\blacksquare$

Note that the traditional definition of un-discounted expected total cost over an infinite run from [14, 2] is not used here, as it is infinite except for the special case of transient MDPs defined in Section II-A. However, in this work, the model $\mathcal{M}$ is not restricted to be transient. Moreover, the discounted total cost in [2] is not used here either due to two reasons: first, it is not obvious how to choose the discount factor for various control tasks $\varphi$ [25]; and second, we are more interested in optimizing the repetitive long-term behavior of the system, rather than the short-term one [20]. In-depth discussions on the optimization of infinite-horizon un-discounted or discounted total-cost criteria over MDPs with or without constraints can be found in [2].

Remark 1.

Different from the maximal reachability problem addressed in [4, 21], a deterministic policy would not suffice here. Instead, randomization is required due to the mean total-cost criterion and the risk constraint, similar to [14]. $\blacksquare$

IV Solution

This section contains the three major parts of the proposed solution: (i) the construction of the product automaton and its AMECs; (ii) the algorithms to synthesize the optimal plan prefix and suffix, for both cases where the AMECs exist or not; (iii) the complete policy, and the online execution algorithm.

IV-A Product Automaton and AMECs

To begin with, we construct the DRA $\mathcal{A}_{\varphi}$ associated with the LTL task formula $\varphi$ via the translation tools [31, 32]. Let it be $\mathcal{A}_{\varphi}=(Q,\,2^{AP},\,\delta,\,q_{0},\,\text{Acc}_{\mathcal{A}})$ , where the notations are defined in Section II-C. Then we construct a product automaton between the robot model $\mathcal{M}$ and the DRA $\mathcal{A}_{\varphi}$ .

Definition 2.

Denote by $\mathcal{P}$ the product $\mathcal{M}\times\mathcal{A}_{\varphi}$ as a 7-tuple:

[TABLE]

where: the state $S\subseteq X\times 2^{AP}\times Q$ is so that $\langle x,\,l,\,q\rangle\in S$ , $\forall x\in X$ , $\forall l\in L(x)$ and $\forall q\in Q$ ; the action set $U$ is the same as in (1) and $U(s)=U(x)$ , $\forall s=\langle x,l,q\rangle\in S$ ; $E=\{(s,u)\,|\,s\in S,\,u\in U(s)\}$ ; the transition probability $p_{E}:S\times U\times S\rightarrow{[0,\,1]}$ is so that

[TABLE]

where (i) $\langle x,l,q\rangle,\,\langle\check{x},\check{l},\check{q}\rangle\in S$ ; (ii) $(x,u)\in D$ ; and (iii) $\check{q}\in\delta(q,\,l)$ ; the cost function $c_{E}:E\rightarrow\mathbb{R}^{>0}$ is so that $c_{E}\big{(}\langle x,l,q\rangle,\,u\big{)}=c_{D}(x,u)$ , $\forall\big{(}\langle x,l,q\rangle,\,u\big{)}\in E$ . Namely, the label $l$ should fulfill the transition condition from $q$ to $\check{q}$ in $\mathcal{A}_{\varphi}$ ; the single initial state is $s_{0}=\langle x_{0},l_{0},q_{0}\rangle\in S$ ; the accepting pairs are defined as $\text{Acc}_{\mathcal{P}}=\{(H^{i}_{\mathcal{P}},\,I^{i}_{\mathcal{P}}),i=1,2,\cdots,N\}$ , where $H^{i}_{\mathcal{P}}=\{\langle x,l,q\rangle\in S\,|\,q\in H^{i}_{\mathcal{A}}\}$ and $I^{i}_{\mathcal{P}}=\{\langle x,l,q\rangle\in S\,|\,q\in I^{i}_{\mathcal{A}}\}$ , $\forall i=1,2,\cdots,N$ . $\blacksquare$

The product $\mathcal{P}$ computes the intersection between all traces of $\mathcal{M}$ and all words that are accepted by $\mathcal{A}_{\varphi}$ , to find all admissible robot behaviors that satisfy the task $\varphi$ . It combines the uncertainty in robot motion and the workspace model by including both $x$ and $l$ in the states. The Rabin accepting condition of $\mathcal{P}$ is defined as follows: An infinite path $R_{\mathcal{P}}=s_{0}s_{1}\cdots$ of $\mathcal{P}$ is accepting if for at least one pair $(H^{i}_{\mathcal{P}},I^{i}_{\mathcal{P}})\in\text{Acc}_{\mathcal{P}}$ it holds that $R_{\mathcal{P}}$ intersects with $H^{i}_{\mathcal{P}}$ finitely often while with $I^{i}_{\mathcal{P}}$ infinitely often. To transform this condition into equivalent graph properties, we need to compute the AMECs of $\mathcal{P}$ associated with its accepting pairs $\text{Acc}_{\mathcal{P}}$ . Detailed definition of MECs is given in Section II-B.

In order to find the complete set of AMECs of $\mathcal{P}$ , for each pair $(H^{i}_{\mathcal{P}},I^{i}_{\mathcal{P}})\in\text{Acc}_{\mathcal{P}}$ , perform the following steps:

(i) Build the MDP $\mathcal{Z}_{i}^{\neg H}\triangleq(S^{\prime},\,U^{\prime},\,E^{\prime},\,p^{\prime}_{E})$ , where $S^{\prime}=S_{i}^{\neg H}\cup\{\nu\}$ is the set of states with $S_{i}^{\neg H}=S\backslash H^{i}_{\mathcal{P}}$ and $\nu$ a trap state; $U^{\prime}=U\,\cup\,\{\tau_{0}\}$ is the set of actions where $\tau_{0}$ is a pseudo action; $E^{\prime}\subset S^{\prime}\times U$ is the set of transitions with the associated probability $p^{\prime}_{E}$ which are defined by three cases: (a) for the transitions within $S_{i}^{\neg H}$ it holds that $(s,\,u)\in E^{\prime}$ and $p^{\prime}_{E}(s,\,u,\,\check{s})=p_{E}(s,\,u,\,\check{s})$ , $\forall(s,\,u)\in E$ where $s,\,\check{s}\in S_{i}^{\neg H}$ ; (b) for the transitions from $S_{i}^{\neg H}$ to outside $S_{i}^{\neg H}$ it holds that $(s,\,u)\in E^{\prime}$ and $p^{\prime}_{E}(s,\,u,\,\nu)=\sum_{\check{s}\notin S_{i}^{\neg H}}p_{E}(s,\,u,\,\check{s})$ , $\forall(s,\,u)\in E$ where $s\in S_{i}^{\neg H}$ ; and (c) the trap state is included in a self-loop such that $(\nu,\,\tau_{0})\in E^{\prime}$ and $p^{\prime}_{E}(\nu,\,\tau_{0},\,\nu)=1$ . Simply speaking, all transitions from inside $S_{i}^{\neg H}$ to outside $S_{i}^{\neg H}$ are transformed to transitions to the trap state $\nu$ .

(ii) Determine all MECs of $\mathcal{Z}_{i}^{\neg H}$ above via Algorithm 47 in [4], which is based on splitting the strongly connected components (SCCs) of $\mathcal{Z}_{i}^{\neg H}$ until the conditions of being an end component are fulfilled. Our implementation for this algorithm can be found in [32]. Denote by $\Xi^{i}=\{(S^{\prime}_{1},\,U^{\prime}_{1}),(S^{\prime}_{2},\,U^{\prime}_{2}),\cdots(S^{\prime}_{C_{i}},\,U^{\prime}_{C_{i}})\}$ the set of MECs, where $S^{\prime}_{c}\subset S^{\prime}$ and $U^{\prime}_{c}:S^{\prime}_{c}\rightarrow 2^{U^{\prime}}$ , $\forall c=1,2,\cdots,C_{i}$ . Note that $S^{\prime}_{c}\cap S^{\prime}_{c^{\prime}}=\emptyset$ , $\forall(S^{\prime}_{c},U^{\prime}_{c}),(S^{\prime}_{c^{\prime}},U^{\prime}_{c^{\prime}})\in\Xi^{i}$ .

(iii) Find $(S^{\prime}_{c},\,U^{\prime}_{c})\in\Xi^{i}$ that is accepting, i.e., it satisfies $\nu\notin S^{\prime}_{c}$ and $S^{\prime}_{c}\cap I^{i}_{\mathcal{P}}\neq\emptyset$ . Save the AMECs in $\Xi^{i}_{acc}$ . Since $\Xi^{i}_{acc}$ is computed for each $(H^{i}_{\mathcal{P}},I^{i}_{\mathcal{P}})\in\text{Acc}_{\mathcal{P}}$ , we denote by $\Xi_{acc}=\{\Xi^{i}_{acc},\,i=1,\cdots,N\}$ the complete set of AMECs of $\mathcal{P}$ .

Remark 2.

A single state with a self-transition can be a MEC with a proper action set. Therefore, there exists at most $|S^{\prime}|$ MECs within $\mathcal{Z}_{i}^{\neg H}$ , $\forall i=1,\cdots,N$ . Thus Step (ii) above has complexity $\mathcal{O}(|S^{\prime}|^{2})$ , as shown in Lemma 10.126 of [4], while Steps (i) and (iii) have complexity linear with $|S^{\prime}|$ . $\blacksquare$

IV-B Plan Prefix and Suffix Synthesis

Given the complete set of AMECs $\Xi_{acc}$ of $\mathcal{P}$ , in this section we show how to synthesize the control policy to drive the system towards $\Xi_{acc}$ and furthermore remain inside $\Xi_{acc}$ while satisfying the accepting condition. As mentioned in Section I, most related work [21, 4, 16, 17] focuses on maximizing the probability of reaching the union of AMECs, i.e., $\cup_{(S^{\prime}_{c},\,U^{\prime}_{c})\in\Xi_{acc}}S^{\prime}_{c}$ , where dynamic programming techniques, such as value or policy iteration, can be applied to obtain the optimal policy. Furthermore, once the system enters any AMEC, e.g., $(S^{\prime}_{c},\,U^{\prime}_{c})\in\Xi_{acc}$ , it has probability $1$ of staying within $S^{\prime}_{c}$ by following $U^{\prime}_{c}$ (see Lemma 10.119 of [4]). The Round-Robin policy is adopted in [21, 17, 4] that ensures all states in $S^{\prime}_{c}$ (including its nonempty intersection with $I_{\mathcal{P}}^{i}$ ) are visited infinitely often. As a result, the task $\varphi$ is satisfied by $\mathcal{P}$ under this policy with the maximal probability.

The above solutions may suffice for verification problems that do not optimize cost or for tasks with trivial accepting conditions. However, for the purposes of plan synthesis and for general tasks, it is of practical interest to simultaneously satisfy the probability of reaching all the AMECs as well as optimize the mean cost of staying within any AMEC and fulfilling the accepting condition. Moreover, when no AECs can be found, instead of simply reporting failure, it is important to obtain a relaxed policy that guarantees high probability of satisfying the task over long time intervals thus minimizing the frequency of encountering bad events. In what follows we present a policy synthesis algorithm that consists of four parts:

•

the plan prefix that drives the system from the initial state to all AMECs, while minimizing the expected cost and respecting the risk constraint; see Section IV-B1;

•

the plan suffix that keeps the system within the AMEC it has reached, while satisfying the accepting condition and optimizing the expected suffix cost; see Section IV-B2;

•

the relaxed prefix and suffix plans for the case where no AECs of $\mathcal{P}$ can be found; see Section IV-B3; and

•

the complete finite-memory policy for the original MDP $\mathcal{M}$ ; see Section IV-C1.

Before stating the solution, we introduce a partition of $S$ given the initial state $s_{0}$ and the set of AMECs $\Xi_{acc}$ . Let $S_{r}\subseteq S$ be the set of states within $S$ that can be reached from $s_{0}$ , which can be derived via a simple graph search in $\mathcal{P}$ .

Definition 3.

Given $s_{0}$ and $\Xi_{acc}$ , $S$ is partitioned as $S=S_{o}\cup S_{c}\cup S_{d}\cup S_{n}$ , where $S_{o}\triangleq S\backslash S_{r}$ is the set of states that can not be reached from $s_{0}$ ; $S_{c}$ is the union of all goal states in $\Xi_{acc}$ , i.e., $S_{c}\triangleq\cup_{(S^{\prime}_{c},\,U^{\prime}_{c})\in\Xi_{acc}}S_{c}^{\prime}$ ; $S_{d}\subseteq S_{r}$ can be reached from $s_{0}$ but can not reach any state in $S_{c}$ ; and $S_{n}\triangleq S_{r}\backslash(S_{c}\cup S_{d})$ . $\blacksquare$

The set $S_{d}$ can be derived through a simple graph search, e.g., by reversing the directed graph associated with $\mathcal{P}$ , finding all reachable nodes of any state within each $(S^{\prime}_{c},\,U^{\prime}_{c})\in\Xi_{acc}$ (as any AMEC is strongly connected) and finally computing its cross intersection with $S_{r}$ ; see [32] for implementation details. Roughly speaking, $S_{n}$ is the set of states related to the plan prefix, $S_{c}$ is the set of goal states related to the plan suffix, and $S_{d}$ is set of bad states to be avoided during the prefix. Since $S_{o}$ contains the states that can not be reached from $s_{0}$ , it is neglected hereafter for the purpose of plan synthesis.

Example 1.

This example illustrates the partition in Definition 3. Consider the toy product automaton $\mathcal{P}$ in Figure 2. For state $s_{0}$ , the set of reachable states is $S_{r}=\{s_{0},s_{1},s_{2},s_{3},s_{5},s_{6},s_{7},s_{8},s_{10}\}$ , the set of unreachable states is $S_{o}=\{s_{4},s_{9}\}$ , the states within an AMEC are $S^{\prime}_{c_{1}}=\{s_{5},s_{6},s_{10}\}$ and another AMEC $S^{\prime}_{c_{2}}=\{s_{7},s_{8}\}$ , thus $S_{c}=S^{\prime}_{c_{1}}\cup S^{\prime}_{c_{2}}=\{s_{5},s_{6},s_{7},s_{8},s_{10}\}$ , the states that can be reached from $s_{0}$ but can not reach $S_{c}$ are $S_{d}=\{s_{1},s_{3}\}$ , and the states that $s_{0}$ can reach outside $S_{c}\cup S_{d}$ are $S_{n}=\{s_{0},s_{2}\}$ . $\blacksquare$

IV-B1 Plan Prefix

Similar to [17, 18], we first construct a modified sub-MDP $\mathcal{Z}_{\texttt{pre}}$ of $\mathcal{P}$ as $\mathcal{Z}_{\texttt{pre}}\triangleq(S_{\texttt{p}},\,U_{\texttt{p}},\,E_{\texttt{p}},\,s_{0},\,p_{\texttt{p}},\,c_{\texttt{p}})$ , where the set of states is given by $S_{\texttt{p}}=S_{n}\cup S_{c}$ with $S_{n},S_{c}$ being defined in Definition 3. The set of actions is given by $U_{\texttt{p}}=U\cup\{\tau_{0}\}$ where $\tau_{0}$ is a self-loop action. The set of transitions $E_{\texttt{p}}$ is the subset of $E$ associated with $S_{\texttt{p}}$ . Moreover, the transition probability $p_{\texttt{p}}$ is defined by (i) $p_{\texttt{p}}(s,u,\check{s})=p_{E}(s,u,\check{s})$ , $\forall s,\check{s}\in S_{\texttt{p}}$ where $s\notin S_{c}$ and $\forall u\in U(s)$ ; and (ii) $p_{\texttt{p}}(s,\tau_{0},s)=1$ , $\forall s\in S_{c}$ . Finally, the cost function $c_{\texttt{p}}$ is defined by (i) $c_{\texttt{p}}(s,u)=c_{E}(s,u)$ , $\forall s\in S_{n}$ and $\forall u\in U(s)$ ; and (ii) $c_{\texttt{p}}(s,\tau_{0})=0$ , $\forall s\in S_{c}$ .

Then, we find a policy for $\mathcal{Z}_{\texttt{pre}}$ such that, starting from $s_{0}$ , it can reach the set of goal states $S_{c}$ with a probability larger than $1-\gamma$ , while at the same time minimizing the expected total cost. Formally, consider the problem below:

Problem 2.

Given the sub-MDP $\mathcal{Z}_{\texttt{pre}}$ , compute an optimal stationary prefix policy $\boldsymbol{\pi}^{\star}_{\texttt{pre}}\in\overline{\boldsymbol{\pi}}$ that solves the problem

[TABLE]

where $s_{0}u_{0}s_{1}u_{1}\cdots$ is a run of $\mathcal{Z}_{\texttt{pre}}$ , $\overline{\boldsymbol{\pi}}$ is the set of all stationary policies, the objective function is the expected total cost, ${Pr}_{s_{0}}^{\boldsymbol{\pi}}(\Diamond S_{c})$ is the probability of reaching $S_{c}$ from the initial state $s_{0}$ , under the policy $\boldsymbol{\pi}$ ; and $\gamma>0$ is from (4). $\blacksquare$

Note that the objective function in (7) is well-defined and finite due to the fact that $\mathcal{Z}_{\texttt{pre}}$ is transient with respect to $S_{n}$ , and is equal to the expected total cost of reaching $S_{c}$ since the cost of staying within $S_{c}$ is zero. We omit the proof that $\mathcal{Z}_{\texttt{pre}}$ is transient here and refer the interested readers to [2, 14]. Our proposed solution to Problem 2 is based on transforming it into a constrained optimization problems for MDPs, which can be then solved using linear programming. The approach is inspired by [16, 17, 14]. Particularly, denote by $y_{s,u}$ the expected number of times over the infinite horizon that the system is at state $s$ and action $u$ is taken, $\forall s\in S_{n}$ and $\forall u\in U(s)$ , which are often referred to as occupancy measures [14] as it holds $y_{s,u}=\sum_{t=0}^{\infty}{Pr}_{s_{0}}^{\boldsymbol{\pi}}[s_{t}=s,\,u_{t}=u]$ , where the probability is conditioned on a policy $\boldsymbol{\pi}$ and the initial state $s_{0}$ . Note that an occupancy measure is a sum of probabilities, but not a probability itself. Consider the linear program:

[TABLE]

where $\sum_{(s,u)}\triangleq\sum_{s\in S_{n}}\sum_{u\in U(s)}$ , the indicator function $\mathbbm{1}(\check{s}=s_{0})=1$ if $\check{s}=s_{0}$ and $\mathbbm{1}(\check{s}=s_{0})=0$ , otherwise. Denote by $\text{C}_{\texttt{pre}}(S_{c})$ the objective function associated with $S_{c}$ . Let the solution of (8) be $y_{\texttt{pre}}^{\star}=\{y_{s,u}^{\star},\,s\in S_{n},\,u\in U(s)\}$ . Then the optimal stationary policy for the plan prefix, denoted by $\boldsymbol{\pi}^{\star}_{\texttt{pre}}$ , can be derived as follows: the probability of choosing action $u$ at state $s$ equals to $\boldsymbol{\pi}_{\texttt{pre}}^{\star}(s,\,u)=y^{\star}_{s,u}/(\sum_{u\in U(s)}y^{\star}_{s,u})$ if $\sum_{u\in U(s)}y^{\star}_{s,u}\neq 0$ ; otherwise, the action at $s$ can be chosen randomly, $\forall s\in S_{c}$ .

Lemma 1.

Given an optimal solution $y_{\texttt{pre}}^{\star}$ of (8), the associated policy $\boldsymbol{\pi}^{\star}_{\texttt{pre}}$ ensures that ${Pr}_{s_{0}}^{\boldsymbol{\pi}^{\star}}(\Diamond S_{c})\geq 1-\gamma$ .

Proof.

First, $y_{s,u}$ is finite and well-defined since $\mathcal{Z}_{\texttt{pre}}$ is transient with respect to $S_{n}$ ,. The second part of the proof is similar to Lemma 3.3 of [16]. The summation $\sum_{(s,u)}\sum_{\check{s}\in S_{c}}y_{s,u}\,p_{\texttt{p}}(s,u,\check{s})$ is the expected number of times that $\mathcal{Z}_{\texttt{pre}}$ transitions from any state in $S_{n}$ into $S_{c}$ for the first time, under policy $\boldsymbol{\pi}^{\star}_{\texttt{pre}}$ from the initial state $s_{0}$ . Since the system remains within $S_{c}$ once it enters $S_{c}$ , the summation equals the probability of eventually reaching the set $S_{c}$ , which is lower-bounded by $1-\gamma$ . This completes the proof. ∎

Example 2.

This example illustrates the important role of $\gamma$ in the trade-off between reducing the expected total cost and minimizing the risk in Problem 2. Consider the unicycle robot with action primitives illustrated in Figure 1 and defined in Section V. The robot moves within partitioned cells as shown in Figure 3, where the red cell has probability $0.9$ to be occupied by an obstacle. Consider the task: $\varphi=(\Diamond\square\texttt{b})\wedge(\square\neg\texttt{obs})$ , i.e., to reach the yellow base without crossing any obstacle. In what follows, we solve (8) under risk factors $\gamma=0$ and $\gamma=0.4$ to derive two different optimal policies. Figure 3 shows a shorter trajectory with lower expected total cost of about $12.6$ when a larger risk is allowed, compared with the right trajectory that avoids completely colliding with the obstacle, but with a much higher total cost of about $33.7$ . $\blacksquare$

IV-B2 Plan Suffix with AMECs

In this section, we present an algorithm to synthesize the plan suffix that minimizes the mean total cost within the AMECs, while ensuring that the system trajectory satisfies the accepting condition of $\mathcal{P}$ . Note that the plan prefix $\boldsymbol{\pi}^{\star}_{\texttt{pre}}$ from the previous section guarantees that the system enters $S_{c}$ from $s_{0}$ with probability higher than $1-\gamma$ . Recall also that $S_{c}=\cup_{(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}}S_{c}^{\prime}$ . Thus it is possible that the system enters any set $S_{c}^{\prime}$ within $\Xi_{acc}$ . For this reason, we propose to treat each AMEC $(S^{\prime}_{c},\,U^{\prime}_{c})\in\Xi_{acc}$ separately, as each $S_{c}^{\prime}$ is associated with different $U^{\prime}_{c}$ and thus a different accepting condition for $S^{\prime}_{c}\cap I_{\mathcal{P}}^{i}$ . Specifically, consider any AMEC $(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}$ and let $I_{c}^{\prime}\triangleq S^{\prime}_{c}\cap I_{\mathcal{P}}^{\prime}$ , which is nonempty by the definition of an AMEC.

Once the system enters any AMEC, most related work [21, 17, 4] adopts the Round-Robin policy defined below:

Definition 4.

For each state $s_{t}\in S^{\prime}_{c}$ , create any ordered sequence of actions from $U_{c}^{\prime}(s_{t})$ , denoted by $\overline{U}(s_{t})$ and its infinite repetition by $\overline{U}^{\omega}(s_{t})$ . Then at any stage $t>0$ , whenever the system reaches $s_{t}\in S_{c}^{\prime}$ , the Round-Robin policy instructs the system to take the next action in $\overline{U}^{\omega}(s_{t})$ , starting from the first action in $\overline{U}^{\omega}(s_{t})$ . $\blacksquare$

Namely, once the system enters $S^{\prime}_{c}$ , the Round-Robin policy iterates over the allowed actions for each state, which in-turn ensures that all states in $S^{\prime}_{c}$ (which include $I^{\prime}_{c}$ ) are visited infinitely often. Detailed can be found in Lemma 10.119 in [4].

Definition 5.

An accepting cyclic path of $\mathcal{P}$ , associated with $S_{c}^{\prime}$ and $I_{c}^{\prime}$ , is a finite path that starts from any state $s_{f}\in I_{c}^{\prime}$ and ends in any state $s_{g}\in I_{c}^{\prime}$ , while remaining within $S_{c}^{\prime}$ . $\blacksquare$

Note that an accepting cyclic path does not necessarily start and end at the same state in $I_{c}^{\prime}$ . Furthermore, we can define the mean cyclic cost of $\mathcal{P}$ under a stationary policy.

Definition 6.

The total cost of a cyclic path $P_{a}=s_{0}u_{0}s_{1}u_{1}\cdots s_{N_{a}}u_{N_{a}}$ is defined as

[TABLE]

where $N_{a}\geq 1$ is the length of the path and $s_{0},s_{N_{a}}\in I_{c}^{\prime}$ . Then its mean total cost is defined as $\textbf{C}_{\texttt{suf}}(P_{a})\triangleq\frac{1}{N_{a}}\overline{\textbf{C}}_{\texttt{suf}}(P_{a})$ . $\blacksquare$

Problem 3.

Find a stationary suffix policy $\boldsymbol{\pi}^{\star}_{\texttt{suf}}$ for $\mathcal{P}$ within $S_{c}^{\prime}$ that minimizes the mean cyclic cost

[TABLE]

where $\mathbf{P}_{a}$ is the set of all accepting cyclic paths associated with the AMEC $(S^{\prime}_{c},U^{\prime}_{c})$ . $\blacksquare$

Inspired by [33, 24, 20], we formulate a Linear Program to solve the mean-payoff optimization problem. First, we construct a modified sub-MDP $\mathcal{Z}_{\texttt{suf}}$ of $\mathcal{P}$ over $S_{c}^{\prime}$ by splitting $I_{c}^{\prime}$ into two virtual copies: $I_{\texttt{in}}$ which only has incoming transitions into $I_{c}^{\prime}$ and $I_{\texttt{out}}$ that has only outgoing transitions from $I_{c}^{\prime}$ . Formally, we define $\mathcal{Z}_{\texttt{suf}}\triangleq(S_{\texttt{e}},\,U_{\texttt{e}},\,E_{\texttt{e}},\,y_{0},\,p_{\texttt{e}},\,c_{\texttt{e}})$ , where the set of states is $S_{\texttt{e}}=(S_{c}^{\prime}\backslash I_{c}^{\prime})\cup I_{\texttt{in}}\cup I_{\texttt{out}}$ with $I_{\texttt{in}}=\{s_{f}^{\texttt{in}},\,\forall s_{f}\in I_{c}^{\prime}\}$ and $I_{\texttt{out}}=\{s_{f}^{\texttt{out}},\,\forall s_{f}\in I_{c}^{\prime}\}$ the virtual copies of $I_{c}^{\prime}$ . The set of control actions is $U_{\texttt{e}}=U\cup\{\tau_{0}\}$ , where $\tau_{0}$ is a self-loop action. The set of state-action pairs $E_{\texttt{e}}\subset S_{\texttt{e}}\times U_{\texttt{e}}$ is defined by (i) $(s,u)\in E_{\texttt{e}}$ , $\forall s\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ and $u\in U_{c}^{\prime}(s)$ ; (ii) $(s,\tau_{0})\in E_{\texttt{e}}$ , $\forall s\in I_{\texttt{in}}$ ; and (iii) $(s_{f}^{\texttt{out}},u)\in E_{\texttt{e}}$ , $\forall s_{f}\in I_{c}^{\prime}$ and $u\in U_{c}^{\prime}(s_{f})$ . Moreover, $y_{0}$ is the initial distribution of all states in $S_{c}^{\prime}$ that can be reached by taking a transition from states in $S_{n}^{\prime}$ , defined by

[TABLE]

where $\{y_{\texttt{pre}}(s,u)\}$ are the variables of (8). Furthermore, the transition probability $p_{\texttt{e}}$ is defined in five cases below: (a) for transitions within $S_{c}^{\prime}\backslash I_{c}^{\prime}$ , it holds that $p_{\texttt{e}}(s,u,\check{s})=p_{E}(s,u,\check{s})$ , $\forall s,\check{s}\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ , $\forall u\in U_{\texttt{e}}(s)$ ; (b) for transitions originated from $I_{\texttt{out}}$ , it holds that $p_{\texttt{e}}(s_{f}^{\texttt{out}},u,\check{s})=p_{E}(s_{f},u,\check{s})$ , $\forall s_{f}^{\texttt{out}}\in I_{\texttt{out}}$ , $\forall u\in U_{\texttt{e}}(s_{f}^{\texttt{out}})$ and $\forall\check{s}\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ ; (c) for transitions into $I_{\texttt{in}}$ , it holds that $p_{\texttt{e}}(s,u,s_{f}^{\texttt{in}})=p_{E}(s,u,s_{f})$ , $\forall s\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ , $\forall u\in U_{\texttt{e}}(s)$ and $\forall s_{f}^{\texttt{in}}\in I_{\texttt{in}}$ ; (d) for transitions from $I_{\texttt{out}}$ to $I_{\texttt{in}}$ , it holds that $p_{\texttt{e}}(s_{f}^{\texttt{out}},u,s_{f}^{\texttt{in}})=p_{E}(s_{f},u,s_{f})$ , $\forall s_{f}^{\texttt{out}}\in I_{\texttt{out}}$ and $\forall u\in U_{\texttt{e}}(s_{f}^{\texttt{out}})$ ; and (e) for transitions within $I_{\texttt{in}}$ , $p_{\texttt{e}}(s_{f}^{\texttt{in}},\tau_{0},s_{f}^{\texttt{in}})=1$ , $\forall s_{f}^{\texttt{in}}\in I_{\texttt{in}}$ . Lastly, the cost function satisfies $c_{\texttt{e}}(s,u)=c_{E}(s,u)$ , $\forall s\in(S_{\texttt{e}}\backslash I_{\texttt{in}})$ , $\forall u\in U_{\texttt{e}}(s)$ , and $c_{\texttt{e}}(s_{f}^{\texttt{in}},\tau_{0})=0$ , $\forall s_{f}^{\texttt{in}}\in I_{\texttt{in}}$ .

Remark 3.

The initial distribution $y_{0}$ of $\mathcal{Z}_{\texttt{suf}}$ indicates how likely it is that the system controlled by the plan prefix $\boldsymbol{\pi}^{\star}_{\texttt{pre}}$ will enter the AMEC $(S^{\prime}_{c},\,U_{c}^{\prime})$ via each state inside $S_{c}^{\prime}$ . $\blacksquare$

Let also $S_{\texttt{e}}^{\prime}\triangleq S_{\texttt{e}}\backslash I_{\texttt{in}}$ and denote by $z_{s,u}$ the long-run frequency with which the system is at state $s$ and the action $u$ is applied, $\forall s\in S_{\texttt{e}}^{\prime}$ and $\forall u\in U_{\texttt{e}}(s)$ . Then, we can formulate the following linear program to solve Problem 3:

[TABLE]

where $\sum_{(s,u)}\triangleq\sum_{s\in S_{\texttt{e}}^{\prime}}\sum_{u\in U_{\texttt{e}}(s)}$ , the first constraint ensures that $I_{\texttt{in}}$ is eventually reached, while the second constraint balances the incoming and outgoing flow at each state. Let its solution be $z_{\texttt{suf}}^{\star}=\{z_{s,u}^{\star},\,\forall s\in S_{\texttt{e}}^{\prime},\,\forall u\in U_{\texttt{e}}(s)\}$ . Then, the optimal stationary policy for the plan suffix, denoted by $\boldsymbol{\pi}^{\star}_{\texttt{suf}}$ , can be derived as follows: the probability of choosing action $u$ at state $s$ equals to $\boldsymbol{\pi}_{\texttt{suf}}^{\star}(s,u)=z^{\star}_{s,u}/(\sum_{u\in U_{\texttt{e}}(s)}z^{\star}_{s,u})$ if $\sum_{u\in U_{\texttt{e}}(s)}z^{\star}_{s,u}\neq 0$ ; otherwise the action at $s$ is chosen randomly, $\forall s\in S_{\texttt{e}}^{\prime}$ . Note that $\boldsymbol{\pi}_{\texttt{suf}}^{\star}(s_{f},u)=\boldsymbol{\pi}_{\texttt{suf}}^{\star}(s_{f}^{\texttt{out}},u)$ , $\forall s_{f}\in I_{c}^{\prime}$ and $\forall u\in U_{c}^{\prime}(s_{f})$ . Namely, once the system reaches any state $s_{g}\in I_{c}^{\prime}$ , the control policy at $s_{g}$ will be the control policy for $s_{g}^{\texttt{out}}\in I_{\texttt{out}}$ , according to the solution of (11).

Remark 4.

The initial distribution is derived from (8), instead of being arbitrarily set as in [25]; Moreover, (11b) ensures that only $I_{c}^{\prime}$ is intersected infinitely often, instead of enforcing that all states in the set $S_{c}^{\prime}$ are visited infinitely often as in [25]. $\blacksquare$

Lemma 2.

If (11) has a solution, then the plan suffix $\boldsymbol{\pi}^{\star}_{\texttt{suf}}$ solves Problem 3 for the chosen AMEC $(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}$ .

Proof.

First, by Definition 5, the objective in (11) equals the mean cyclic cost of all accepting cyclic paths for $I_{c}^{\prime}$ . Moreover, by the definition of an AMEC, any path remains within $S_{\texttt{e}}^{\prime}$ by choosing only actions within $U_{c}^{\prime}(s)$ at each state $s\in S_{\texttt{e}}^{\prime}$ . ∎

Lemma 3.

Let $\boldsymbol{\tau}_{\mathcal{P}}$ be the set of all accepting runs of $\mathcal{P}$ that enter $S_{c}^{\prime}$ after a finite number of steps. If $\tau_{\mathcal{P}}\in\boldsymbol{\tau}_{\mathcal{P}}$ is generated under $\boldsymbol{\pi}^{\star}_{\texttt{suf}}$ , then $\tau_{\mathcal{P}}$ satisfies the accepting condition of $\mathcal{P}$ . Moreover, the mean total cost in (2) equals the mean cyclic cost in (10), i.e., $\mathbb{E}_{\tau_{\mathcal{P}}\in\boldsymbol{\tau}_{\mathcal{P}}}\{\textbf{{Cost}}(\tau_{\mathcal{P}})\}=\textbf{{C}}_{\texttt{suf}}(S^{\prime}_{c},\,U^{\prime}_{c})$ .

Proof.

By (11), any system trajectory of $\mathcal{P}$ under $\boldsymbol{\pi}^{\star}_{\texttt{suf}}$ contains infinite occurrences of accepting cyclic paths. Since any accepting cyclic path starts from and ends in $I_{c}^{\prime}$ (which is finite), $\tau_{\mathcal{P}}$ intersects with $I_{c}^{\prime}$ infinitely often. Moreover, since any accepting cyclic path remains within $S_{c}^{\prime}$ , $\tau_{\mathcal{P}}$ remains within $S_{c}^{\prime}$ for all time after entering $S_{c}^{\prime}$ . In other words, $\tau_{\mathcal{P}}$ intersects with $H^{i}_{\mathcal{P}}$ a finite number of times before entering $S_{c}^{\prime}$ and then intersects $I^{i}_{\mathcal{P}}$ infinitely often after entering $S_{c}^{\prime}$ , which satisfies the Rabin accepting condition of $\mathcal{P}$ . To show the second part, notice that the product $\mathcal{P}$ under $\boldsymbol{\pi}_{\texttt{suf}}^{\star}$ evolves as a Markov chain and the set of all accepting cyclic paths within $S_{c}^{\prime}$ has a stationary distribution. By viewing any accepting run $\tau_{\mathcal{P}}$ as the concatenation of an infinite number of cyclic paths, the mean total cost of $\tau_{\mathcal{P}}$ defined in (4) over an infinite time horizon equals the mean cyclic cost in (10) of all cyclic paths contained in $\tau_{\mathcal{P}}$ . This result is important in showing the equivalence between Problems 1 and 3 later in Theorem 6. ∎

Example 3.

This example illustrates the difference between the plan suffix obtained by (11) and the Round-Robin policy. Consider the same robot model from Example 2 and the partitioned workspace in Figure 4. The task is to surveil three base stations in the corners, i.e. $\varphi=(\square\Diamond\texttt{b1})\wedge(\square\Diamond\texttt{b2})\wedge(\square\Diamond\texttt{b3})$ . The plan prefix is derived by solving (8) but two different plan suffixes are used: one using (11) and the Round-Robin policy. Figure 4 shows the simulated trajectory under these two policies. It can be seen that the trajectory under the optimal plan suffix approximates the shortest route to cross all base stations, while the trajectory under the Round-Robin policy exhibits a rather random behavior. $\blacksquare$

IV-B3 Plan Synthesis when AECs do Not Exist

The synthesis algorithms proposed in Sections IV-B1 and IV-B2 rely on the assumption that the set of AMECs $\Xi_{acc}$ of $\mathcal{P}$ is nonempty which, however, might not hold in many scenarios. In this case, most existing techniques proposed in [4, 17, 21, 22] can not be applied. In this section, we first provide a simple example where no AECs exist, and then propose an approach to synthesize a relaxed plan prefix and suffix.

Example 4.

This example provides a robot model $\mathcal{M}$ and its task $\varphi$ for which no AECs exist in the product automaton $\mathcal{P}$ . Consider the MDP $\mathcal{M}$ in Figure 5 that transitions between two states ( $S_{1}$ , $S_{2}$ ) with probability $1$ using the action $f$ . Note that $S_{1}$ has only probability $0.01$ of being occupied by an obstacle and $S_{2}$ is the base station. The task is to surveil the base station while avoiding obstacles, i.e., $\varphi=(\square\Diamond\texttt{b})\wedge(\square\neg\texttt{obs})$ . The associated DRA is shown in Figure 5. The resulting $\mathcal{P}$ is shown in Figure 6, where the set of states $H_{i}^{\mathcal{P}}$ to avoid in the suffix is in red and the set of states $I_{i}^{\mathcal{P}}$ to intersect infinitely often in green. The reason that no AECs exist in $\mathcal{P}$ is because by definition an AEC $(S^{\prime},\,\{f\})$ should include all successor states that are reachable by the single action $f$ . Then, starting from any green state in $I_{i}^{\mathcal{P}}$ , the set of reachable states eventually intersect with the red states in ${H}_{i}^{\mathcal{P}}$ . $\blacksquare$

When no AECs exist in $\mathcal{P}$ , the probability of satisfying the task under any policy is zero. However, it is still important to identify those policies that ensure high probability of avoiding bad states over long time intervals. Consequently, we propose to use an accepting SCC (ASCC) of $\mathcal{P}$ as the relaxed AMEC, due to the following lemma.

Lemma 4.

Assume there exists one infinite path of $\mathcal{P}$ that is accepting. Then, there exists at least one SCC of $\mathcal{P}$ that intersects with $I^{i}_{\mathcal{P}}$ but not with $H^{i}_{\mathcal{P}}$ , for at least one pair $(H^{i}_{\mathcal{P}},\,I^{i}_{\mathcal{P}})\in\textup{Acc}_{\mathcal{P}}$ .

Proof.

As mentioned before, an infinite path of $\mathcal{P}$ , denoted by $R_{\mathcal{P}}$ , is accepting if for at least one pair $(H^{i}_{\mathcal{P}},I^{i}_{\mathcal{P}})\in\text{Acc}_{\mathcal{P}}$ it holds that $R_{\mathcal{P}}$ intersects with all states in $H^{i}_{\mathcal{P}}$ finitely often while with $I^{i}_{\mathcal{P}}$ infinitely often. Since both $H^{i}_{\mathcal{P}}$ and $I^{i}_{\mathcal{P}}$ are finite, there exists a cyclic path $s_{k}\cdots s_{f}\cdots s_{k}$ of $\mathcal{P}$ that contains at least one $s_{f}\in I^{i}_{\mathcal{P}}$ and does not contain any state within $H^{i}_{\mathcal{P}}$ . By definition, this cyclic path is a SCC of $\mathcal{P}$ that intersects with $I^{i}_{\mathcal{P}}$ but not with $H^{i}_{\mathcal{P}}$ . This completes the proof. ∎

Denote the set of SCCs in $\mathcal{P}$ as $\Omega\triangleq\{S_{1}^{\prime},S_{2}^{\prime},\cdots,S_{C}^{\prime}\}$ , where $S_{c}^{\prime}\subseteq S$ . This set can derived using Tarjan’s algorithm [4, 32]. Moreover, denote by $\Omega^{i}_{acc}=\{S_{c}^{\prime}\in\Omega\,|\,S_{c}^{\prime}\cap I^{i}_{\mathcal{P}}\neq\emptyset,\,S_{c}^{\prime}\cap H^{i}_{\mathcal{P}}=\emptyset\}$ the set of SCCs that satisfy the accepting conditions associated with $(H^{i}_{\mathcal{P}},\,I^{i}_{\mathcal{P}})\in\text{Acc}_{\mathcal{P}}$ . Lemma 4 ensures that $\Omega^{i}_{acc}\neq\emptyset$ for at least one pair $(H^{i}_{\mathcal{P}},\,I^{i}_{\mathcal{P}})\in\text{Acc}_{\mathcal{P}}$ . Therefore, the union $\Omega_{acc}\triangleq\cup_{i=1,\cdots,N}\,\Omega^{i}_{acc}$ is not empty.

Now the union $S_{c}\triangleq\cup_{S^{\prime}_{c}\in\Omega_{acc}}S^{\prime}_{c}$ serves as the set of states the system should enter, starting from the initial state, and then remain inside any of the ASCC to satisfy the accepting condition. Again the first step is to formulate a Linear Program that minimizes the expected total cost of reaching $S_{c}$ from $s_{0}$ , while ensuring the risk is upper-bounded by the chosen $\gamma_{\texttt{prex}}>0$ . It can be done analogously as in (8) but over $S_{n}\triangleq S\backslash S_{c}$ (which is omitted here). Denote the objective function by $\textup{{C}}_{\texttt{prex}}(S_{c})$ and its set of variables by $\{y_{\texttt{prex}}(s,u)\}$ and the associated relaxed plan prefix as $\boldsymbol{\pi}_{\texttt{prex}}$ . Same as in Section IV-B2, it is possible that the system under the policy $\boldsymbol{\pi}_{\texttt{prex}}$ can enter any ASCC in $\Omega_{acc}$ . Assume that the system enters $S_{c}^{\prime}\in\Omega_{acc}$ . Different from an AMEC $(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}$ , the action set at each state of $S_{c}^{\prime}\in\Omega_{acc}$ is not constrained. Thus, there is no guarantee that the system will stay within $S_{c}^{\prime}$ after entering it.

Therefore, the second step is to synthesize the relaxed plan suffix that keeps the system inside $S_{c}^{\prime}$ to satisfy the accepting condition with the maximal probability. Define the set $I_{c}^{\prime}=S_{c}^{\prime}\cap I^{i}_{\mathcal{P}}$ , which is not empty for an ASCC $S_{c}^{\prime}$ . Then, an accepting cyclic path of $\mathcal{P}$ associated with $I_{c}^{\prime}$ , and the cyclic cost associated with $S_{c}^{\prime}$ and $I_{c}^{\prime}$ can be defined similarly as in Definition 5. Formally, we consider the following problem:

Problem 4.

Find a control policy for $\mathcal{P}$ that minimizes the mean cyclic cost associated with the ASCC $S_{c}^{\prime}$ : $\mathbb{E}^{\boldsymbol{\pi}}_{P_{a}\in\mathbf{P}_{a}}\{\textbf{C}_{\texttt{sufx}}(P_{a})\}$ , where $\mathbf{P}_{a}$ is the set of all accepting cyclic paths associated with $S^{\prime}_{c}$ and $\textbf{C}_{\texttt{sufx}}$ is defined as in Definition 6; while at the same time maximizing the probability that the cyclic paths stay within $S_{c}^{\prime}$ . $\blacksquare$

In Problem 4, the first objective of minimizing the mean cyclic cost corresponds to minimizing the mean total cost in (4) in Problem 1. The objective of maximizing the probability of the system staying within the ASCC $S^{\prime}_{c}$ corresponds to minimizing the frequency with which the system will reach the bad states that violate the task specifications. It constitutes a relaxation of the risk constraint (4) in Problem 1. To solve Problem 4, first we construct a modified MDP $\mathcal{Z}_{\texttt{sufx}}$ over $S_{c}^{\prime}$ , which is similar to $\mathcal{Z}_{\texttt{suf}}$ in Section IV-B2. The set $I_{c}^{\prime}$ is split into two virtual copies: $I_{\texttt{in}}$ which only has incoming transitions and $I_{\texttt{out}}$ that has only outgoing transitions. Formally, we define $\mathcal{Z}_{\texttt{sufx}}=(S_{\texttt{r}},\,U_{\texttt{r}},\,E_{\texttt{r}},\,y_{0},\,p_{\texttt{r}},\,c_{\texttt{r}})$ , where the set of states is $S_{\texttt{r}}=(S_{c}^{\prime}\backslash I_{c}^{\prime})\cup I_{\texttt{in}}\cup I_{\texttt{out}}\cup\{s_{bad}\}$ , with $I_{\texttt{in}}=\{s_{f}^{\texttt{in}},\,\forall s_{f}\in I_{c}^{\prime}\}$ and $I_{\texttt{out}}=\{s_{f}^{\texttt{out}},\,\forall s_{f}\in I_{c}^{\prime}\}$ the two virtual copies of $I_{c}^{\prime}$ , and $s_{bad}$ is a virtual bad state. The set of control actions is given by $U_{\texttt{r}}=U\cup\{\tau_{0}\}$ , where $\tau_{0}$ is a self-loop action. The set of transition is $E_{\texttt{r}}\subset S_{\texttt{r}}\times U_{\texttt{r}}$ which satisfies that (i) $(s,u)\in E_{\texttt{r}}$ , $\forall s\in S_{c}^{\prime}$ and $u\in U(s)$ ; (ii) $(s,\tau_{0})\in E_{\texttt{r}}$ , $\forall s\in I_{\texttt{in}}$ ; and (iii) $(s_{bad},\tau_{0})\in E_{\texttt{r}}$ . Moreover, $y_{0}$ is the initial distribution of states in $S_{c}^{\prime}$ based on the transition from states in $S_{n}^{\prime}$ :

[TABLE]

where $\sum_{(\check{s},u)}\triangleq\sum_{\check{s}\in S_{n}^{\prime}}\sum_{u\in U_{\texttt{p}}(\check{s})}$ and $\{y_{\texttt{prex}}(s,u)\}$ are the variables solutions from the synthesis of the relaxed plan prefix, and $y_{0}(s_{bad})=0$ . Furthermore, the transition probability $p_{\texttt{r}}$ is defined in seven cases below: (a) for transitions within $S_{c}^{\prime}\backslash I_{c}^{\prime}$ , it holds that $p_{\texttt{r}}(s,u,\check{s})=p_{E}(s,u,\check{s})$ , $\forall s,\check{s}\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ , $\forall u\in U_{\texttt{r}}(s)$ ; (b) for transitions originated from $I_{\texttt{out}}$ , it holds that $p_{\texttt{r}}(s_{f}^{\texttt{out}},u,\check{s})=p_{E}(s_{f},u,\check{s})$ , $\forall s_{f}^{\texttt{out}}\in I_{\texttt{out}}$ , $\forall u\in U_{\texttt{r}}(s_{f}^{\texttt{out}})$ and $\forall\check{s}\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ ; (c) for transitions into $I_{\texttt{in}}$ , it holds that $p_{\texttt{r}}(s,u,s_{f}^{\texttt{in}})=p_{E}(s,u,s_{f})$ , $\forall s\in S_{c}^{\prime}\backslash I_{c}^{\prime}$ , $\forall u\in U_{\texttt{r}}(s)$ and $\forall s_{f}^{\texttt{in}}\in I_{\texttt{in}}$ ; (d) for transitions from $I_{\texttt{out}}$ to $I_{\texttt{in}}$ , it holds that $p_{\texttt{r}}(s_{f}^{\texttt{out}},u,s_{f}^{\texttt{in}})=p_{E}(s_{f},u,s_{f})$ , $\forall s_{f}^{\texttt{out}}\in I_{\texttt{out}}$ and $\forall u\in U_{\texttt{r}}(s_{f}^{\texttt{out}})$ ; (e) for transitions into the bad state $s_{bad}$ , it holds that $p_{\texttt{r}}(s,u,s_{bad})=p_{E}(s,u,\check{s})$ , $\forall s\in S_{c}^{\prime}\backslash I_{\texttt{in}}$ , $\forall\check{s}\in S\backslash S_{c}^{\prime}$ and $u\in U_{\texttt{r}}(s)$ ; (f) each state within $I_{\texttt{in}}$ is included in a self-loop such that $p_{\texttt{r}}(s_{f}^{\texttt{in}},\tau_{0},s_{f}^{\texttt{in}})=1$ , $\forall s_{f}^{\texttt{in}}\in I_{\texttt{in}}$ ; (g) the bad state is included in a self-loop such that $p_{\texttt{r}}(s_{bad},\tau_{0},s_{bad})=1$ . Finally, the cost function $c_{\texttt{r}}$ is defined in two cases: (i) $c_{\texttt{r}}(s,u)=c_{E}(s,u)$ , $\forall s\in S_{\texttt{r}}\backslash I_{\texttt{in}}$ , $\forall u\in U_{\texttt{r}}(s)$ ; and (ii) $c_{\texttt{r}}(s_{f}^{\texttt{in}},\tau_{0})=0$ , $\forall s_{f}^{\texttt{in}}\in I_{\texttt{in}}$ and $c_{\texttt{r}}(s_{bad},\tau_{0})=0$ .

Remark 5.

Note that $E_{\texttt{r}}$ contains all actions for each state in $S_{c}^{\prime}$ , compared with $E_{\texttt{e}}$ as allowed by the AMEC. $\blacksquare$

Let $S_{\texttt{r}}^{\prime}\triangleq S_{\texttt{r}}\backslash(I_{\texttt{in}}\cup\{s_{bad}\})$ and $S_{\texttt{r}}^{\prime\prime}\triangleq S_{\texttt{r}}\backslash\{s_{bad}\}$ . We can also show that $\mathcal{Z}_{\texttt{sufx}}$ above is $S_{\texttt{r}}^{\prime}-$ transient. Then, to solve Problem 4, we rely on a technique proposed in [35] to deal with dead ends in Stochastic Shortest Path (SSP) problems. First we introduce a large positive penalty for reaching the dead state, denoted by $d>0$ . Then, we modify (11) as follows: denote by $z_{s,u}$ the long-run frequency with which the system is at state $s$ and the action $u$ is taken, $\forall s\in S_{\texttt{r}}^{\prime}$ and $\forall u\in U_{\texttt{r}}(s)$ . We want to minimize the mean total cost of reaching $I_{\texttt{in}}$ from $I_{\texttt{out}}$ , while minimizing the probability of leaving $S_{\texttt{s}}^{\prime\prime}$ . In particular, we consider the following optimization:

[TABLE]

where the notation $\sum_{(\check{s},u)}\triangleq\sum_{\check{s}\in S_{\texttt{r}}^{\prime}}\sum_{u\in U_{\texttt{r}}(s)}$ , the variables satisfy that $\eta(\check{s},u,s)\triangleq z_{\check{s},u}\,p_{\texttt{r}}(\check{s},u,s)$ , $\eta(\check{s},u,s_{bad})\triangleq z_{\check{s},u}\,p_{\texttt{r}}(\check{s},u,s_{bad})$ , $\text{C}_{\texttt{sufx}}(S_{c}^{\prime},d)$ denotes the objective function as the summation of the mean cost of reaching $I_{\texttt{in}}$ and the expected penalty of reaching $s_{bad}$ . The first constraint balances the incoming and outgoing flow at each state, while the second constraint ensures that $I_{\texttt{in}}\cup\{s_{bad}\}$ are eventually reached. Let the optimal solution of (12) be $z^{\star}_{\texttt{sufx}}=\{z_{s,u}^{\star},\,s\in S_{\texttt{r}}^{\prime},u\in U_{\texttt{r}}(s)\}$ . Then, the optimal stationary policy for the relaxed plan suffix, denoted by $\boldsymbol{\pi}^{\star}_{\texttt{sufx}}$ , can be derived as follows: for states in $S_{\texttt{r}}^{\prime}$ , the optimal policy is given by $\pi_{\texttt{sufx}}^{\star}(s,u)=z^{\star}_{s,u}/(\sum_{u\in U_{\texttt{r}}(s)}z^{\star}_{s,u})$ if $\sum_{u\in U_{\texttt{r}}(s)}z^{\star}_{s,u}\neq 0$ ; otherwise the action at $s$ is chosen randomly, $\forall s\in S_{\texttt{r}}^{\prime}$ . Note that $\pi_{\texttt{sufx}}^{\star}(s_{f},u)=\pi_{\texttt{sufx}}^{\star}(s_{f}^{\texttt{out}},u)$ , $\forall s_{f}\in I_{c}^{\prime}$ and $\forall u\in U(s_{f})$ .

Lemma 5.

Under the relaxed plan suffix $\boldsymbol{\pi}^{\star}_{\textup{{sufx}}}$ , the probability of $\mathcal{Z}_{\texttt{sufx}}$ reaching $I_{\texttt{in}}$ from $I_{\texttt{out}}$ while staying within $S_{\texttt{r}}^{\prime\prime}$ over an infinite horizon, is lower bounded by $1-\gamma_{\texttt{sufx}}(d)$ , where ${\gamma_{\texttt{sufx}}(d)}\triangleq\sum_{\check{s}\in S_{\texttt{r}}^{\prime}}\sum_{u\in U_{\texttt{r}}(\check{s})}z^{\star}_{\texttt{sufx}}(\check{s},u)\,p_{\texttt{r}}(\check{s},u,s_{bad})$ .

Proof.

The proof is a simple inference from (12c). ∎

Remark 6.

A lower bound can be enforced on $\gamma_{\texttt{sufx}}$ as in (8). However, this bound is hard to estimate and a large bound can yield the problem infeasible. In contrast, (12) always has a solution and $\gamma_{\texttt{sufx}}(d)$ is tunable by varying $d$ . $\blacksquare$

IV-C The Complete Policy

In this section, we present how to combine the stationary plan prefix and plan suffix of $\mathcal{P}$ into the complete finite-memory policy of the original MDP $\mathcal{M}$ . Furthermore, we show how to execute this finite-memory policy online.

IV-C1 Combining the Plan Prefix and Suffix

When AMECs of $\mathcal{P}$ exist, we can combine the plan prefix synthesis and the plan suffix synthesis for each AMEC into one Linear Program:

[TABLE]

where $\textup{{C}}_{\texttt{pre}}(S_{c})$ and $\textup{{C}}_{\texttt{suf}}(S_{c}^{\prime},U_{c}^{\prime})$ are defined in (8a) and (11a), respectively, the variables $\{y_{s,u}\}$ satisfy the constraints (8b)–(8d) and (11c), and the variables $\boldsymbol{z}_{s,u}\triangleq\{z_{s,u}(S_{c}^{\prime}),\forall(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}\}$ , where $z_{s,u}(S_{c}^{\prime})$ , satisfy the constraints (11c)–(11d) for the AMEC $(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}$ . The parameter $0\leq\beta\leq 1$ captures the importance of minimizing the expected total cost to reach $S_{c}$ versus stay in $S_{c}$ . Note that the initial conditions $y_{0}$ in (11c) for each state in the suffix are expressed over the variables $\{y_{s,u}\}$ . In other words, the initial conditions of each AMEC are now optimized to solve the combined objective function (13). It can be solved via any Linear Programming solver, e.g., “Gurobi” [36] and “CPLEX”. Once the optimal solution $\{y^{\star}_{s,u}\}$ and $\boldsymbol{z}^{\star}_{s,u}$ is obtained, the optimal plan prefix $\boldsymbol{\pi}^{\star}_{\texttt{pre}}$ can be constructed as described in Section IV-B1 and the plan suffix $\boldsymbol{\pi}^{\star}_{\texttt{suf}}$ as in Section IV-B2.

On the other hand, when no AECs of $\mathcal{P}$ exist, as discussed in Section IV-B3, we can combine the relaxed plan prefix and suffix synthesis for each ASCC into one Linear Program:

[TABLE]

where $\textup{{C}}_{\texttt{prex}}(S_{c})$ and $\textup{{C}}_{\texttt{sufx}}(S_{c}^{\prime},d)$ are defined in (8a) and (12a), respectively, the variables $\{y_{s,u}\}$ satisfy the constraints (8b)–(8d), and the variables $\boldsymbol{z}_{s,u}\triangleq\{z_{s,u}(S_{c}^{\prime}),\forall S_{c}^{\prime}\in\Omega_{acc}\}$ , where $z_{s,u}(S_{c}^{\prime})$ , satisfy the constraints (12b)–(12d) for the ASCC $S_{c}^{\prime}\in\Omega_{acc}$ . The parameter $0\leq\beta\leq 1$ captures the importance of minimizing the expected total cost to reach $S_{c}$ versus stay in $S_{c}$ . Similar to the previous case, the initial conditions $y_{0}$ in (12b) for each state in the ASCCs are expressed over the variables $\{y_{s,u}\}$ . Thus the initial conditions are now optimized to solve the combined objective function (14). Again, it can be solved via any Linear Programming solver. Once the optimal $\{y^{\star}_{s,u}\}$ and $\boldsymbol{z}^{\star}_{s,u}$ is obtained, the optimal relaxed plan prefix $\boldsymbol{\pi}^{\star}_{\texttt{prex}}$ and relaxed plan suffix $\boldsymbol{\pi}^{\star}_{\texttt{sufx}}$ can be constructed as described in Section IV-B3.

Note that the size of both Linear Programs in (13) and (14) is linear with respect to the number of transitions in $\mathcal{P}$ and can be solved in polynomial time [37]. Note also that the multi-objective costs introduced in (13) and (14) provide a balance between optimizing the plan prefix and suffix. Compared to only optimizing the plan suffix, i.e., for $\beta=0$ as required to solve Problems 3 and 4, increasing slightly the value of $\beta$ can lead to a significant decrease in the total cost of the plan prefix, without sacrificing much the optimality in the plan suffix.

Observe that the optimal policy derived above only includes the states within $S_{n}\cup S_{c}$ . Thus no policy is specified for the bad states in $S_{d}$ . Once the system reaches any bad state, it has violated the formula $\varphi$ and can not satisfy it anymore. Thus, it is common practice to stop the system once that happens [21, 4]. We propose here a new method that allows the system to recover from the bad state in $S_{d}$ and continue performing the task, which could be useful for partially-feasible tasks with soft constraints, as discussed in [7].

Definition 7.

The projected distance of a bad state $s_{d}=\langle x,l,q\rangle\in S_{d}$ onto $S_{c}\cup S_{n}$ via $u\in U(s_{d})$ is defined as:

[TABLE]

where $\check{s}\triangleq\langle\check{x},\check{l},\check{q}\rangle$ and function $\texttt{D}:2^{AP}\times 2^{2^{AP}}\rightarrow\mathbb{N}$ returns the distance between an element $l\in 2^{AP}$ and a set $\chi\subseteq 2^{AP}$ , was firstly introduced in [7] and restated below. $\blacksquare$

Simply speaking, $\kappa(s_{d},\,u)$ evaluates how much the product automaton $\mathcal{P}$ is violated on the average if the bad state $s_{d}\in S_{d}$ is projected into the set of good states $S_{c}\cup S_{n}$ using action $u\in U(s_{d})$ . Function $\texttt{D}(\ell,\,\chi)=0$ if $\ell\in\chi$ and $\texttt{D}(\ell,\,\chi)=\min_{\ell^{\prime}\in\chi}\;|\{a\in AP\,|\,a\in\ell,a\notin\ell^{\prime}\}|$ , otherwise. Namely, it returns the minimal difference between $\ell$ and any element in $\chi$ . Given $\kappa(\cdot)$ , the policy at $s_{d}\in S_{d}$ is given by

[TABLE]

which chooses the single action that minimizes (15). Combing (13), (14) and (16) provides the complete policy for $\mathcal{P}$ . The above discussions are summarized in Algorithm 1.

IV-C2 Mapping ${\pi}^{\star}$ to ${\mu}^{\star}$

Lastly, we need to map the optimal stationary policy $\boldsymbol{\pi}^{\star}$ of $\mathcal{P}$ above to the optimal finite-memory policy $\boldsymbol{\mu}^{\star}$ of $\mathcal{M}$ . Starting from stage $t=0$ , the initial state $s_{0}=\langle x_{0},l_{0},q_{0}\rangle\in S_{n}$ and the optimal action to take is given by the distribution $\boldsymbol{\pi}^{\star}(s_{0})$ . Assume that $u\in U(s_{0})$ is taken. Then at stage $t=1$ , the robot observes its resulting state $x_{1}$ and the label $l_{1}$ . Thus the subsequent state in $\mathcal{P}$ is $s_{1}=\langle x_{1},l_{1},q_{1}\rangle$ , where $q_{1}=\delta(q_{0},l_{0})$ is unique as $\mathcal{A}_{\varphi}$ is deterministic. The optimal action to take now is given by the distribution $\boldsymbol{\pi}^{\star}(s_{1})$ . This process repeats itself indefinitely. Denote by $s_{t}\in S$ the reachable state at stage $t\geq 0$ which is always unique given the robot’s past sequence of states $X_{t}=x_{0}x_{1}\cdots x_{t}$ and labels $L_{t}=l_{0}l_{1}\cdots l_{t}$ . Thus the optimal policy $\boldsymbol{\mu}^{\star}$ at stage $t\geq 0$ given $X_{t}$ and $L_{t}$ is

[TABLE]

i.e., the control policy at the reachable state $s_{t}$ in $\mathcal{P}$ is the best control policy in $\mathcal{M}$ at stage $t$ , $\forall t\geq 0$ . Last but not least, if the system reaches a bad state at stage $t-1$ , i.e., $s_{t-1}\in S_{d}$ , according to policy (16) the robot will take action $u^{\star}$ and more importantly the next reachable state is set to be $s_{t}\triangleq\langle x_{t},l_{t},q^{\prime}_{t}\rangle\in(S_{c}\cup S_{n})$ , where $x_{t}$ , $l_{t}$ are the observed robot location and label at stage $t$ and $q^{\prime}_{t}\triangleq\text{argmin}_{\check{q}\in Post(q_{t-1})}\texttt{D}(l_{t-1},\chi(q_{t-1},\check{q}))$ .

Theorem 6.

Algorithm 1 solves Problem 1 if AECs of $\mathcal{P}$ exist and $\beta=0$ . Otherwise, if no AECs of $\mathcal{P}$ exist, then Problem 1 has no solution. In this case, Algorithm 1 provides a relaxed policy that minimizes the relaxed suffix cost $\text{C}_{\texttt{sufx}}(S_{c}^{\prime},d)$ defined in (12). Moreover, given any finite run $S_{T}=s_{0}s_{1}\cdots s_{T}$ of $\mathcal{P}$ under the optimal policy $\boldsymbol{\pi}^{\star}$ , the probability that $S_{T}$ does not intersect with the set of bad states $S_{d}$ for all time $t\in[0,\,T]$ is bounded as

[TABLE]

where $N_{s}\geq 0$ is the number of accepting cyclic paths contained in $S_{T}$ that depends on $T$ .

Proof.

To show the first part of this theorem, similar to Lemma 1, the constraints of (8b)–(8d) ensures that the total probability of reaching the union of all AMECs is lower-bounded by $1-\gamma$ . Moreover, the first part of Lemma 3 shows that any infinite run $\tau_{\mathcal{P}}$ of $\mathcal{P}$ would satisfy $\varphi$ once it enters any AMEC $(S^{\prime}_{c},U^{\prime}_{c})\in\Xi_{acc}$ , by following the plan suffix. The fact that $\boldsymbol{\pi}^{\star}$ also minimizes the mean total cost in (4) when $\beta=0$ in (13) can be shown as follows: as discussed in [33, 24, 34], the mean payoff objective depends on how the system suffix behaves within the AMECs. The second part of Lemma 3 guarantees that the derived plan suffix $\boldsymbol{\pi}_{\texttt{suf}}^{\star}$ minimizes the mean total cost of staying within any of the AMECs, while satisfying the accepting condition.

To show the second part of the theorem, no solution to Problem 1 exists regardless of the choice of $\gamma$ , as the probability of satisfying the task is zero. Instead, when $\beta=0$ , the optimal policy $\boldsymbol{\pi}^{\star}$ obtained by Algorithm 1 minimizes the relaxed suffix cost $\text{C}_{\texttt{sufx}}(S_{c}^{\prime},d)$ . At the same time, due to the constraints in (8) that are also present in (13), the plan prefix $\boldsymbol{\pi}^{\star}_{\texttt{prex}}$ ensures that all runs stay within $S_{n}$ with at least probability $(1-\gamma_{\texttt{prex}})$ before entering any ASCC $S_{c}^{\prime}\in\Omega_{acc}$ , while the relaxed plan suffix $\boldsymbol{\pi}_{\texttt{sufx}}^{\star}$ ensures that the runs stay within $S_{c}^{\prime}$ with at least probability $(1-\gamma_{\texttt{sufx}}(d))$ for one execution of any accepting cyclic path. Consequently, if the finite run contains $N_{s}$ accepting cyclic paths, the probability of avoiding $S_{d}$ , is lower bounded by $(1-\gamma_{\texttt{prex}})\cdot(1-\gamma_{\texttt{sufx}}(d))^{N_{s}}$ . Even though this probability approaches zero as $N_{s}$ approaches infinity, this result still ensures that the frequency of visiting bad states over finite intervals is minimized. ∎

IV-C3 Policy Execution

Clearly, the optimal policy $\boldsymbol{\mu}^{\star}$ from (17) requires only a finite memory to save the current reachable state $s_{t}$ and the optimal policy $\boldsymbol{\pi}^{\star}$ . It is synthesized off-line once via Algorithm 1 and its online execution involves observing the current state $x_{t}$ and label $l_{t}$ , updating the reachable state $s_{t}$ , and applying the action according to $\boldsymbol{\pi}^{\star}(s_{t})$ . Details are given in Algorithm 2.

V Simulation Results

In this section, we present simulation results to validate the scheme. All algorithms are implemented in Python 2.7 and available online [32]. All simulations are carried out on a laptop (3.06GHz Duo CPU and 8GB of RAM).

V-A Model Description

We consider a partitioned $10m\times 10m$ workspace as shown in Figure 8, where each cell is a $2m\times 2m$ area. The properties of interest are $\{\texttt{Obs},\texttt{b1},\texttt{b2},\texttt{b3},\texttt{Spl}\}$ . The properties satisfied at each cell are probabilistic: three cells at the corners satisfy b1, b2 and b3, respectively with probability one. Four cells at $(1m,5m),(5m,3m),(9m,5m),(5m,9m)$ satisfy Spl with probabilities ranging from $0.2$ to $0.8$ , modeling the likelihood that a supply appears at that particular cell. One cell at $(5m,1m)$ satisfies Obs with probability $0.7$ . Other obstacles will be described later upon different task scenarios.

The robot motion follows the unicycle model, i.e., $\dot{x}=v\cos(\theta)$ , $\dot{y}=v\sin(\theta)$ , $\dot{\theta}=\omega$ , where $p(t)=(x(t),\,y(t))\in\mathbb{R}^{2}$ , $\theta(t)\in(-\mathbf{pi},\,\mathbf{pi}]$ are the robot’s position and orientation at time $t\geq 0$ . The control input is $u(t)=(v(t),\,\omega(t))$ and contains the linear and angular velocities. Due to actuation noise and drifting, the robot’s motion is subject to uncertainty. The action primitives and the associated uncertainties are shown in Figure 1 and described below: action “FR” means driving forward for $2m$ by setting $v(t)=v_{0}$ and $\omega(t)=0$ , $\forall t=[0,\,2/v_{0}]$ . This action has probability $0.8$ of reaching $2m$ forward and probability $0.1$ of drifting to the left or right by $2m$ , respectively; action “BK” can be defined analogously to “FR”; action “TR” means turning right by an angle of $\mathbf{pi}/2$ by setting $v(t)=0$ and $\omega(t)=-\omega_{0}$ , $\forall t=[0,\,\mathbf{pi}/(2\omega_{0})]$ . This action has probability $0.9$ of turning to the right by $\mathbf{pi}/2$ , probability $0.05$ of turning less than $\mathbf{pi}/4$ due to undershoot and probability $0.05$ of turning more than $3\mathbf{pi}/4$ due to overshoot; action “TL” can be defined analogously to “TR”; lastly, action “ST” means staying still by setting $v(t)=\omega(t)=0$ , $\forall t=[0,T_{0}]$ where $T_{0}$ is the chosen waiting time. It has probability $1.0$ of staying where it is. The cost of each action is given by $[2,4,3,3,1]$ , respectively, where the cost of “ST” is set to $1$ as it consumes time to wait at one cell.

With the above model, we can abstract the robot state by the cell coordinate in which it belongs, namely, $(x_{c},y_{c})\in\{1,3,\cdots,9\}^{2}$ and its four possible orientations ( $N,E,S,W$ ). The transition relation and probability can be built following the description above. The resulting probabilistically-labeled MDP has $100$ states and $816$ edges.

In the sequel, we consider three different task formulas in the order of increasing complexity. We used “Gurobi” [36] to solve the Linear Programs in (13) and (14). When comparing the performance in the plan suffix, we also use the total cost in (9) as an indicator, especially when the difference in the mean total cost in (10) is too small to measure.

V-B Ordered Reachability

In this case, we show the trade-off between reducing the expected total cost and decreasing the risk factor in the plan prefix synthesis using (8). In particular, the robot needs to reach b1, b2, b3 (in this order) from the initial cell while avoiding obstacles for all time. Afterwards it should stay at b3. The LTL formula for this task is

[TABLE]

The associated DRA derived using [31] has $7$ states, $24$ transitions and $1$ accepting pair. An additional obstacle is added which has probability $0.7$ of appearing in the cell $(5m,9m)$ .

It took $10.9s$ to construct the product automaton which has $840$ states, $7280$ transitions. Since one AMEC exists, we synthesize the optimal policy using Algorithm 1 via solving (13) under $\beta=0.5$ and different risk factors $\gamma$ chosen from $\{0,0.1,\cdots,0.4\}$ , which took on average $0.1s$ . Then, we perform $1000$ Monte Carlo simulations of $500$ time steps each, where we evaluate the total cost in (7) and whether the task is satisfied. As shown in Table I, the total cost increases when the allowed risk factor $\gamma$ is decreased. The percentage of simulated runs that collide with an obstacle is approximately $(1-\gamma)$ , which verifies the risk constraint in Lemma 1.

V-C Surveillance

In this case, we compare the efficiency of the optimal plan suffix from Algorithm 1 and the Round-Robin policy. Particularly, the robot should visit b1, b2 and b3 infinitely often for surveillance and avoid all obstacles:

[TABLE]

The associated DRA has $8$ states, $30$ transitions, and $1$ accepting pair. It took $5.8s$ to construct the product $\mathcal{P}$ which has $700$ states, $5712$ transitions and $1$ accepting pair. Since one AMEC exists in the product, we synthesize the optimal policy using Algorithm 1 via solving (13) under $\gamma=0$ and $\beta=0.1$ , which took $0.2s$ . We conducted $1000$ Monte Carlo simulations and Figure 7 shows that the total cost from (9) of accepting cyclic paths in the plan suffix under the optimal policy is much lower than the Round-Robin policy ( $50$ versus $400$ ). Moreover, Figure 9 shows that the average number of times each base station is visited by the robot under the optimal policy is much higher than under the Round-Robin policy.

V-D Ordered Supply-delivery

In this case, we demonstrate the reactiveness of the derived optimal policy. The robot needs to collect supplies from the cells that are marked by Spl, where supplies appear probabilistically. Then it needs to transport these supplies to each base station. Furthermore, the robot should not visit two base stations consecutively without collecting a supply first. It should always avoid obstacles. The LTL task formula is

[TABLE]

where $\varphi_{\texttt{all\_base}}=(\square\Diamond\texttt{b1})\wedge(\square\Diamond\texttt{b2})\wedge(\square\Diamond\texttt{b3})$ means that all base stations should be visited infinitely often and $\varphi_{\texttt{order}}=\square(\varphi_{\texttt{one}}\rightarrow\bigcirc((\neg\varphi_{\texttt{one}})\,\mathsf{U}\,\texttt{Spl}))$ , with $\varphi_{\texttt{one}}=(\texttt{b1}\vee\texttt{b2}\vee\texttt{b3})$ means that when one base station is visited, then no base can be visited until a supply has been collected. The associated DRA is derived using [31, 32] in $0.05s$ , which has $32$ states, $298$ transitions and $1$ accepting pair.

It took around $16s$ to construct the product automaton that has $4224$ states, $41344$ transitions and 1 accepting pair. Since two AMECs exist in the product, we synthesize the optimal policy using Algorithm 1 via solving (13) under $\gamma=0$ and $\beta=0.1$ , which took around $0.2s$ given the complexity of task (20). Notice that the optimal plan sometimes requires the robot to wait at a cell marked by Spl by taking action “ST”, since the expected cost of traveling to another cell with supply might be higher than waiting there for the supply to appear. Figure 8 compares the simulated trajectories under the optimal policy and the Round-robin policy. Based on $1000$ Monte Carlo simulations, the total cost of accepting cyclic paths is much lower under the optimal policy than the Round-Robin policy ( $70$ versus $550$ ). Furthermore, Figure 9 shows the average number of supplies received at each base under these two policies. It can be seen that much more supplies are received at each base station under the optimal policy. Simulation videos of both cases can be found in [38]. Lastly, to show how the choice of $\beta$ in (13) affects the optimal prefix and suffix cost, we repeat the above procedure for different $\beta$ and the results are summarized in Table III. In the table, the prefix cost equals to $\textup{{C}}_{\texttt{pre}}(S_{c})$ , the mean suffix cost equals to $\sum_{(S_{c}^{\prime},U_{c}^{\prime})\in\Xi_{acc}}\textup{{C}}_{\texttt{suf}}(S_{c}^{\prime},U_{c}^{\prime})$ from (13). The total suffix cost is computed based on (9) in order to magnify the changes in the suffix cost. It can be noticed that for small non-zero values of $\beta$ , less $0.2$ , the optimal prefix cost is reduced dramatically (from $180.7$ to $62.4$ ), without increasing much the optimal suffix cost (from $66.1$ to $67.1$ ).

In order to demonstrate scalability and computational complexity of the proposed algorithm, we repeat the policy synthesis under the same task (20) but for workspaces of various sizes. Particularly, we increase the number of cells from $5^{2}$ to $9^{2}$ , $15^{2}$ , $19^{2}$ , $25^{2}$ , $29^{2}$ . The size of resulting $\mathcal{M}$ , $\mathcal{P}$ , $\Xi_{acc}$ and the time taken to compute them are shown in Table II, where we also list the complexity of the LP (13), which consists of (8) and (11), and the time taken to solve (13). It can be seen from Table II that solving (13) requires a small fraction of total time, compared to the construction of $\mathcal{M}$ , $\mathcal{P}$ and $\Xi_{acc}$ .

V-E Surveillance with Clustered Obstacles

In this case, we demonstrate how the relaxed plan prefix and suffix can be synthesized under scenarios where no AECs can be found. In particular, we consider the surveillance task in (19) but more obstacles are placed in the workspace as shown in Figure 10. The center cell $(5m,\,5m)$ has probability $0.9$ of being occupied by an obstacle and the four cells above and on the left have probability $0.01$ of being occupied by an obstacle. Thus, b1 is surrounded by possible obstacles around it, even though the probability is very low.

The resulting product automaton has $1184$ states, $13888$ transitions, and $1$ accepting pair. It can be verified that no AECs exist in $\mathcal{P}$ and thus the second case of Algorithm 1 is activated, where the optimal solution is derived by solving (14). We synthesize the relaxed optimal policy under different $\gamma_{\texttt{prex}}$ and $d$ , as shown in Table IV. It took in average $37s$ to synthesize the complete policy for $\beta=0.1$ and any chosen $\gamma_{\texttt{prex}}$ and $d$ in this case. Recall that $d$ is a large positive penalty for entering the set of bad states in (12). In particular, we first choose $\gamma_{\texttt{prex}}=0.1$ and $d=300$ . Two simulated trajectories under the derived policy are shown in Figure 10. Furthermore, we perform $1000$ Monte Carlo simulation under the $\gamma_{\texttt{prex}}$ and $d$ listed in Table IV, where we compare the number of times that the robot fails the task by colliding with obstacles (the failure), the number of times that the robot successfully reaches the set of ASCC $S_{c}$ (the prefix success), and the number of times that the robot successfully executes one accepting cyclic path associated with $S_{c}^{\prime}$ and $I_{c}^{\prime}$ of one ASCC (the suffix success). It can be seen that $(1-(1-\gamma_{\texttt{prex}})(1-\gamma_{\texttt{sufx}}))$ , $(1-\gamma_{\texttt{prex}})$ and $(1-\gamma_{\texttt{sufx}})$ matches very well the probability of failure, the prefix success, and the suffix success, respectively, as discussed in Theorem 6. Also, it can be seen that the system can recover from the bad states and continue executing the task if the recovery policy proposed in (16) is activated. It can also be seen that increasing $\gamma_{\texttt{prex}}$ leads to a lower prefix success rate and decreasing $d$ leads to a lower suffix success rate.

To demonstrate scalability and computational complexity of the proposed algorithm when AMECs do not exist, we repeat the policy synthesis under the same task (19) but for different workspaces of various sizes, as in Section V-D. We set $\gamma=0.3$ , $d=300$ and $\beta=0.1$ . The size of resulting $\mathcal{M}$ , $\mathcal{P}$ , $\Omega_{acc}$ and the time taken to compute them are shown in Table V, where we also list the complexity of the (14), which consists of (8) and (12), and the time taken to solve (14). It can be seen above that solving (14) now requires a larger fraction of total time, compared to the construction of $\mathcal{M}$ , $\mathcal{P}$ and $\Omega_{acc}$ . However, it requires much less time to compute the set of ASCCs $\Omega_{acc}$ than the set of AMECs $\Xi_{acc}$ . For instance, in the case of $29^{2}$ cells in the workspace, it took around $23.1$ seconds to construct $\mathcal{P}$ (which has approximately $2.8\times 10^{4}$ states and $2.9\times 10^{5}$ transitions) and $19.6$ seconds to construct its ASCCs (compared with $160$ minutes in Table II). Once (14) is constructed, it took around $2.5$ minutes to solve it.

V-F Comparison with PRISM

In this section we compare the proposed algorithm to the widely-used model-checking tool PRISM [13]. The following results were obtained using PRISM 4.3.1, where Linear Programming is chosen as the solution method. First, since PRISM does not take the probabilistically-labeled MDP in (1) as inputs, we translate the product automaton in (5) into PRISM language and verify its Rabin accepting condition directly. Implementation details can be found in [32]. For tasks (18), (19) and (20), PRISM verifies that the probability of satisfying each of them is $1.0$ , within time $0.46s$ , $0.38s$ and $6.4s$ , respectively. The difference in computation time is likely due to the difference in the LP solvers. Second, in order to test different values of $\gamma$ , we use the “multi-objective property” to find the minimal cumulative reward while ensuring the risk of violating the task is bounded by $\gamma$ . Note that the associated model has to be the modified product model $\mathcal{Z}_{\texttt{pre}}$ defined in Section IV-B1 as PRISM does not currently support multi-objective property with the “F target” operator (i.e., $\Diamond S_{c}$ ). The computation time is approximately the same as in the previous cases. Last, the current PRISM version does not support the mean-payoff optimization in the AMECs, nor does it generate the relaxed control policy for the case where no AMECs exist in the product automaton. In fact, PRISM will simply return that the maximal probability of satisfying the task is [math]. The MultiGain tool recently proposed in [34] can handle multiple mean-payoff constraints but does not allow the tuning of the satisfaction probability $(1-\gamma)$ .

VI Experimental Study

In this section, we present an experimental study. We use a differential-driven “iRobot” whose position we track in real-time via an Optitrack motion capture system. The communication among the planning module, the robot actuation module, and the Optitrack is handled by the Robot Operating System (ROS). The software implementation for this experiment is available in [39]. The experiment videos are online [40].

VI-A Model Description

Consider the $2.5m\times 1.5m$ experiment workspace as shown in Figure 11, with three base stations located at the corners and one obstacle region. It consists of $5\times 3$ square cells of dimension $0.5m\times 0.5m$ each. The robot’s motion within the workspace is abstracted similarly as in Section V-A. The resulting MDP has $60$ states and $456$ edges.

VI-B Experimental Results

We consider two different tasks: first the sequential visiting task (18) and then the surveillance task (19).

VI-B1 Sequential Visiting Task

The LTL task formula is given in (18) and the associated DRA is constructed in Section V-B. The obstacle has probability $0.1$ of appearing in the cell $(1.25m,1.25m)$ . The resulting product automaton in this case has $532$ states and $4228$ edges and $1$ accepting pair. For $\gamma=0$ and $\beta=0.1$ , it took $3.16s$ to synthesize the complete policy using Algorithm 1, resulting in an average prefix cost $47.72$ and suffix cost $1.0$ . Then the robot was controlled in real-time using Algorithm 2. The robot state was retrieved using the motion capture system and the observed label was generated randomly. The complete video is online [40] and the resulting trajectory shown in Figure 12. Notice that the robot avoids completely collision with the obstacle.

VI-B2 Surveillance Task

The LTL task formula is given in (19) and the associated DRA is constructed in Section V-B. The obstacle has probability $0.1$ of appearing in the cell $(1.25m,0.75cm)$ . The resulting product automaton in this case has $608$ states, $4992$ edges, and $1$ accepting pair.

In the first experiment, we choose $\gamma=0$ and $\beta=0.1$ so that there is no risk allowed in the plan prefix. It took $5.2s$ to synthesize the complete plan offline using Algorithm 1. The real-time execution of the system followed Algorithm 2. The resulting trajectory is shown in Figure 12. In the second experiment, we selected $\gamma=0.1$ and $\beta=0.1$ to allow risk in the plan prefix. It took $4.9s$ to synthesize the complete policy. Compared to the case where $\gamma=0$ , the optimal policy instructs the robot to move forward, straight to the base station at $(2.25m,0.25m)$ , even though there is a risk of colliding with the obstacle at $(1.25m,0.75m)$ due to the uncertainty in its forward action. Both experiment videos are online [40].

Lastly, to demonstrate the proposed scheme for much larger workspaces and more complex tasks, particularly when no AMECs can be found in the product automaton, we create a virtual experiment platform based on V-REP [41], which is available in [32]. A snapshot is shown in Figure 11. The user can easily change the configuration of the workspace and the robot task specification. Once the control policy is synthesized via Algorithm 1 and saved, the user can perform any number of test runs in this environment. Demonstration videos are online [40] where we replicate the surveillance task with clustered obstacles from Section V-E. It can be seen that the relaxed control policy can ensure high probability of avoiding bad states over long time intervals.

VII Conclusion and Future Work

In this paper, we propose a plan synthesis algorithm for probabilistic motion planning, subject to high-level LTL task formulas and risk constraints. Uncertainties in both the robot motion and the workspace properties are considered. We obtain optimal policies that optimize the total cost both in the prefix and suffix of the system trajectory. We also address the case where no AECs exist in the product automaton in which case the probability of satisfying the task is zero. The proposed solution provides provable guarantees on the probabilistic satisfiability and the mean total-cost optimality, and is verified via both numerical simulations and experimental studies. Future work involves extensions to multi-robot systems.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics . MIT press, 2005.
2[2] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014.
3[3] G. E. Fainekos, A. Girard, H. Kress-Gazit, and G. J. Pappas, “Temporal logic motion planning for dynamic robots,” Automatica , vol. 45, no. 2, pp. 343–352, 2009.
4[4] C. Baier and J.-P. Katoen, Principles of model checking . MIT press Cambridge, 2008.
5[5] C. Belta, A. Bicchi, M. Egerstedt, E. Frazzoli, E. Klavins, and G. J. Pappas, “Symbolic planning and control of robot motion,” Robotics & Automation Magazine, IEEE , vol. 14, no. 1, pp. 61–70, 2007.
6[6] M. Guo, M. Egerstedt, and D. V. Dimarogonas, “Hybrid control of multi-robot systems using embedded graph grammars,” in Robotics and Automation (ICRA), IEEE International Conference on , 2016.
7[7] M. Guo and D. V. Dimarogonas, “Multi-agent plan reconfiguration under local LTL specifications,” The International Journal of Robotics Research , vol. 34, no. 2, pp. 218–235, 2015.
8[8] E. M. Wolff, U. Topcu, and R. M. Murray, “Robust control of uncertain markov decision processes with temporal logic specifications,” in Decision and Control (CDC), Conference on . IEEE, 2012, pp. 3372–3379.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Probabilistic Motion Planning under Temporal Tasks

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Transient MDP

II-B End Components

II-C LTL and DRA

III Problem Formulation

III-A Mathematical Model

Assumption 1**.**

Definition 1**.**

Problem 1**.**

Remark 1**.**

IV Solution

IV-A Product Automaton and AMECs

Definition 2**.**

Remark 2**.**

IV-B Plan Prefix and Suffix Synthesis

Definition 3**.**

Example 1**.**

IV-B1 Plan Prefix

Problem 2**.**

Lemma 1**.**

Proof.

Example 2**.**

IV-B2 Plan Suffix with AMECs

Definition 4**.**

Definition 5**.**

Definition 6**.**

Problem 3**.**

Remark 3**.**

Remark 4**.**

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Example 3**.**

IV-B3 Plan Synthesis when AECs do Not Exist

Example 4**.**

Lemma 4**.**

Proof.

Problem 4**.**

Remark 5**.**

Lemma 5**.**

Proof.

Remark 6**.**

IV-C The Complete Policy

IV-C1 Combining the Plan Prefix and Suffix

Definition 7**.**

IV-C2 Mapping π⋆{\pi}^{\star}π⋆ to μ⋆{\mu}^{\star}μ⋆

Theorem 6**.**

Proof.

IV-C3 Policy Execution

V Simulation Results

V-A Model Description

V-B Ordered Reachability

V-C Surveillance

V-D Ordered Supply-delivery

V-E Surveillance with Clustered Obstacles

V-F Comparison with PRISM

VI Experimental Study

VI-A Model Description

VI-B Experimental Results

VI-B1 Sequential Visiting Task

VI-B2 Surveillance Task

VII Conclusion and Future Work

Assumption 1.

Definition 1.

Problem 1.

Remark 1.

Definition 2.

Remark 2.

Definition 3.

Example 1.

Problem 2.

Lemma 1.

Example 2.

Definition 4.

Definition 5.

Definition 6.

Problem 3.

Remark 3.

Remark 4.

Lemma 2.

Lemma 3.

Example 3.

Example 4.

Lemma 4.

Problem 4.

Remark 5.

Lemma 5.

Remark 6.

Definition 7.

IV-C2 Mapping ${\pi}^{\star}$ to ${\mu}^{\star}$

Theorem 6.