Theoretical and Experimental Analysis of the Canadian Traveler Problem

Doron Zarchy

arXiv:1702.07001·cs.AI·February 24, 2017

Theoretical and Experimental Analysis of the Canadian Traveler Problem

Doron Zarchy

PDF

Open Access

TL;DR

This paper provides a comprehensive theoretical and experimental analysis of the Canadian Traveler Problem, introducing a new variant with dependencies, and proposing an optimal algorithm that outperforms existing methods.

Contribution

It introduces Dep-CTP with dependencies, proves its intractability, and develops Gen-PAO, an optimal algorithm for multiple CTP variants with improved performance.

Findings

01

Dep-CTP is intractable.

02

Gen-PAO solves multiple CTP variants optimally.

03

Pruning methods improve search efficiency.

Abstract

Devising an optimal strategy for navigation in a partially observable environment is one of the key objectives in AI. One of the problem in this context is the Canadian Traveler Problem (CTP). CTP is a navigation problem where an agent is tasked to travel from source to target in a partially observable weighted graph, whose edge might be blocked with a certain probability and observing such blockage occurs only when reaching upon one of the edges end points. The goal is to find a strategy that minimizes the expected travel cost. The problem is known to be P $#$ hard. In this work we study the CTP theoretically and empirically. First, we study the Dep-CTP, a CTP variant we introduce which assumes dependencies between the edges status. We show that Dep-CTP is intractable, and further we analyze two of its subclasses on disjoint paths graph. Second, we develop a general algorithm Gen-PAO…

Equations179

t = 0 \sum t = \infty (s_{t}, a, s_{t + 1}) Where a = π (s_{t})

t = 0 \sum t = \infty (s_{t}, a, s_{t + 1}) Where a = π (s_{t})

t = 0 \sum t = \infty γ^{t} (s_{t}, a, s_{t + 1}) Where a = π (s_{t})

t = 0 \sum t = \infty γ^{t} (s_{t}, a, s_{t + 1}) Where a = π (s_{t})

V_{π, 0} (s)

V_{π, 0} (s)

V_{π, n} (s)

V_{π} (s) = \overset{s}{ˊ} \in S \sum T (s, π (s), \overset{s}{ˊ}) \cdot (R (s, π (s), \overset{s}{ˊ}) + γ \cdot V_{π} (\overset{s}{ˊ}))

V_{π} (s) = \overset{s}{ˊ} \in S \sum T (s, π (s), \overset{s}{ˊ}) \cdot (R (s, π (s), \overset{s}{ˊ}) + γ \cdot V_{π} (\overset{s}{ˊ}))

V^{*} (s) = M A X_{π (s) \in A} \overset{s}{ˊ} \in S \sum T (s, π (s), \overset{s}{ˊ}) \cdot (R (s, π (s), \overset{s}{ˊ}) + γ \cdot V^{*} (\overset{s}{ˊ}))

V^{*} (s) = M A X_{π (s) \in A} \overset{s}{ˊ} \in S \sum T (s, π (s), \overset{s}{ˊ}) \cdot (R (s, π (s), \overset{s}{ˊ}) + γ \cdot V^{*} (\overset{s}{ˊ}))

V_{i + 1}^{π_{k}} (s) \leftarrow s^{'} \sum T (s, π_{k} (s), s^{'}) [R (s, π_{k} (s), s^{'}) + γ V_{i}^{π_{k}} (s^{'})]

V_{i + 1}^{π_{k}} (s) \leftarrow s^{'} \sum T (s, π_{k} (s), s^{'}) [R (s, π_{k} (s), s^{'}) + γ V_{i}^{π_{k}} (s^{'})]

π_{k + 1} (s) = a r g ma x_{a} s \sum' T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{π_{k}} (s^{'})]

π_{k + 1} (s) = a r g ma x_{a} s \sum' T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{π_{k}} (s^{'})]

P (b^{'} ∣ a, b) = o \sum P (b^{'}, o ∣ a, b) = o \sum P (b^{'} ∣ o, a, b) P (o ∣ a, b)

P (b^{'} ∣ a, b) = o \sum P (b^{'}, o ∣ a, b) = o \sum P (b^{'} ∣ o, a, b) P (o ∣ a, b)

P (o ∣ a, b)

P (o ∣ a, b)

P (o ∣ s^{'} a, b)

P (s^{'} ∣ a, b)

P (b^{'} ∣ a, b) = o \sum P (b^{'} ∣ o, a, b) s^{'} \sum Ω (s^{'}, o, a) \sum P (s^{'} ∣ s, a) b (s)

P (b^{'} ∣ a, b) = o \sum P (b^{'} ∣ o, a, b) s^{'} \sum Ω (s^{'}, o, a) \sum P (s^{'} ∣ s, a) b (s)

P (b^{'} ∣ a, b) = L \cdot s^{'} \sum Ω (s^{'}, o, a) s \sum P (s^{'} ∣ s, a) b (s)

P (b^{'} ∣ a, b) = L \cdot s^{'} \sum Ω (s^{'}, o, a) s \sum P (s^{'} ∣ s, a) b (s)

P (s^{'} ∣ a, s) = \int_{s \in S} P (s^{'} ∣ s, a) P (s) d s

P (s^{'} ∣ a, s) = \int_{s \in S} P (s^{'} ∣ s, a) P (s) d s

V^{*} (b) = M A X_{a \in A} [r (b, a) + γ b \in B \sum τ (b^{'}, a, b) V (b_{o}^{a})]

V^{*} (b) = M A X_{a \in A} [r (b, a) + γ b \in B \sum τ (b^{'}, a, b) V (b_{o}^{a})]

v (b) = M A X_{0 \leq i \leq k} \sum b (s) v^{i} (s)

v (b) = M A X_{0 \leq i \leq k} \sum b (s) v^{i} (s)

C_{H}(n)=\left\{\begin{array}[]{l l}c(n)&\quad n\in T\\ \min\limits_{\left\langle n,n^{\prime}\right\rangle\in E_{H}}c(\left\langle n,n^{\prime}\right\rangle)+C_{H}(n^{\prime})&\quad n\in N_{OR}\\ \sum\limits_{\left\langle n,n^{\prime}\right\rangle\in E_{H}}p(\left\langle n,n^{\prime}\right\rangle)(c(\left\langle n,n^{\prime}\right\rangle)+C_{H}(n^{\prime}))&\quad n\in N_{AND}\\ \end{array}\right.

C_{H}(n)=\left\{\begin{array}[]{l l}c(n)&\quad n\in T\\ \min\limits_{\left\langle n,n^{\prime}\right\rangle\in E_{H}}c(\left\langle n,n^{\prime}\right\rangle)+C_{H}(n^{\prime})&\quad n\in N_{OR}\\ \sum\limits_{\left\langle n,n^{\prime}\right\rangle\in E_{H}}p(\left\langle n,n^{\prime}\right\rangle)(c(\left\langle n,n^{\prime}\right\rangle)+C_{H}(n^{\prime}))&\quad n\in N_{AND}\\ \end{array}\right.

h (n) = i min c (⟨ n, n_{i} ⟩) + C_{H} (n_{i})

h (n) = i min c (⟨ n, n_{i} ⟩) + C_{H} (n_{i})

i \sum p (⟨ n, n_{i} ⟩) (c (⟨ n, n_{i} ⟩) + C_{H} (n_{i}))

i \sum p (⟨ n, n_{i} ⟩) (c (⟨ n, n_{i} ⟩) + C_{H} (n_{i}))

R(s,a,s^{\prime})=\left\{\begin{array}[]{l l}w(e)&\quad\text{if }Tr(s,a,s^{\prime})=1\\ 0&\quad otherwise\end{array}\right.

R(s,a,s^{\prime})=\left\{\begin{array}[]{l l}w(e)&\quad\text{if }Tr(s,a,s^{\prime})=1\\ 0&\quad otherwise\end{array}\right.

R(s,a,s^{\prime})=\left\{\begin{array}[]{l l}SC(e)&\quad\text{ if }Tr(s,a,s^{\prime})=1\\ 0&\quad otherwise\end{array}\right.

R(s,a,s^{\prime})=\left\{\begin{array}[]{l l}SC(e)&\quad\text{ if }Tr(s,a,s^{\prime})=1\\ 0&\quad otherwise\end{array}\right.

O(s,a,z)=\left\{\begin{array}[]{l l}1&\quad\text{ if }z=\{st_{e}(s)|e\in Inc(v_{j})\}\\ 0&\quad otherwise\end{array}\right.

O(s,a,z)=\left\{\begin{array}[]{l l}1&\quad\text{ if }z=\{st_{e}(s)|e\in Inc(v_{j})\}\\ 0&\quad otherwise\end{array}\right.

O(s,a,z)=\left\{\begin{array}[]{l l}1&\quad\text{ if }z=st_{e}(s)\\ 0&\quad otherwise\end{array}\right.

O(s,a,z)=\left\{\begin{array}[]{l l}1&\quad\text{ if }z=st_{e}(s)\\ 0&\quad otherwise\end{array}\right.

stb(e,b)=\left\{\begin{array}[]{l l}Open&\quad$if edge $e$ is known to be Open in $b\\ Blocked&\quad$if edge $e$ is known to be Blocked in $b\\ Unknown&\quad\;otherwise$ (if the status of edge $e$ is Unknown in $b)\\ \end{array}\right.

stb(e,b)=\left\{\begin{array}[]{l l}Open&\quad$if edge $e$ is known to be Open in $b\\ Blocked&\quad$if edge $e$ is known to be Blocked in $b\\ Unknown&\quad\;otherwise$ (if the status of edge $e$ is Unknown in $b)\\ \end{array}\right.

b(s)=\left\{\begin{array}[]{l l}0&\quad\text{ if }G(e,s,b)=0\\ \prod\limits_{\{e\in Unknown(b)|st(e,s)=Blocked\}}p(e)\prod\limits_{\{e\in Unknown(b)|st(e,s)=Open\}}{1-p(e)}&\quad Otherwise\\ \end{array}\right.

b(s)=\left\{\begin{array}[]{l l}0&\quad\text{ if }G(e,s,b)=0\\ \prod\limits_{\{e\in Unknown(b)|st(e,s)=Blocked\}}p(e)\prod\limits_{\{e\in Unknown(b)|st(e,s)=Open\}}{1-p(e)}&\quad Otherwise\\ \end{array}\right.

P_{b}(e,b)=\left\{\begin{array}[]{l l}0&\quad$if $stb(e,b)=Open\\ 1&\quad$if $stb(e,b)=Blocked\\ P(e)&\quad\;$if $stb(e,b)=Unknwon\\ \end{array}\right.

P_{b}(e,b)=\left\{\begin{array}[]{l l}0&\quad$if $stb(e,b)=Open\\ 1&\quad$if $stb(e,b)=Blocked\\ P(e)&\quad\;$if $stb(e,b)=Unknwon\\ \end{array}\right.

T r (b^{'}, a, b) = P (b^{'} ∣ a, b) = z \sum P (b^{'}, z ∣ a, b) = z \sum P (b^{'} ∣ z, a, b) P (z ∣ a, b)

T r (b^{'}, a, b) = P (b^{'} ∣ a, b) = z \sum P (b^{'}, z ∣ a, b) = z \sum P (b^{'} ∣ z, a, b) P (z ∣ a, b)

\hat{E} = {e \in E ∣ s t b (e, b) = U nk n o w n, s t b (e, b^{'}) \neq = U nk n o w n, e \in I n c (v_{j}), b \in B, b^{'} \in B}

\hat{E} = {e \in E ∣ s t b (e, b) = U nk n o w n, s t b (e, b^{'}) \neq = U nk n o w n, e \in I n c (v_{j}), b \in B, b^{'} \in B}

T r (b^{'}, a, b) = e \in \hat{E}, s t b (b^{'}, e) = B l oc k e d \prod p (e) e \in \hat{E}, s t b (b^{'}, e) = O p e n \prod 1 - p (e)

T r (b^{'}, a, b) = e \in \hat{E}, s t b (b^{'}, e) = B l oc k e d \prod p (e) e \in \hat{E}, s t b (b^{'}, e) = O p e n \prod 1 - p (e)

Tr(b^{\prime},a,b)=\left\{\begin{array}[]{l l}p(e)&\quad\text{ if }stb(e,b)=Blocked\\ 1-p(e)&\quad otherwise\end{array}\right.

Tr(b^{\prime},a,b)=\left\{\begin{array}[]{l l}p(e)&\quad\text{ if }stb(e,b)=Blocked\\ 1-p(e)&\quad otherwise\end{array}\right.

R(b,a,b^{\prime})=\left\{\begin{array}[]{l l}C(a)&\quad\text{ if and only if }Tr(b,a,b^{\prime})>0\\ 0&\quad otherwise\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTransportation Planning and Optimization · Energy, Environment, and Transportation Policies · Transportation and Mobility Innovations

MethodsPruning

Full text

Thesis for the degree Master of Science

Theoretical and Experimental Analysis of the Canadian Traveler Problem

Submitted by: Doron Zarchy

Advisor: Prof. Eyal Shimony

Department of Computer Science

Faculty of Science

Ben Gurion University of the Negev

Abstract

Devising an optimal strategy for navigation in a partially observable environment is one of the key objectives in AI. One of the problem in this context is the Canadian Traveler Problem (CTP). CTP is a navigation problem where an agent is tasked to travel from source to target in a partially observable weighted graph, whose edge might be blocked with a certain probability and observing such blockage occurs only when reaching upon one of the edges end points. The goal is to find a strategy that minimizes the expected travel cost. The problem is known to be P $\#$ hard. In this work we study the CTP theoretically and empirically. First, we study the Dep-CTP, a CTP variant we introduce which assumes dependencies between the edges status. We show that Dep-CTP is intractable, and further we analyze two of its subclasses on disjoint paths graph. Second, we develop a general algorithm that optimally solve the CTP called General Propagating AO* (Gen-PAO). Gen-PAO is capable of solving two other types of CTP called Sensing-CTP and Expensive-Edges CTP. Since the CTP is intractable, Gen-PAO use some pruning methods to reduce the space search for the optimal solution. We also define some variants of Gen-PAO, compare their performance and show some benefits of Gen-PAO over existing work.

1 Introduction
2 Background
2.1 Markov Decision Process
2.1.1 Policy Iteration
2.2 Partially Observable Markov Decision Process
2.2.1 Value Iteration
2.3 The Canadian Traveler Problem
2.3.1 CTP with Dependencies in Disjoint Path Graphs
2.4 AND/OR Graphs
2.4.1 AO*
2.4.2 CTP and AND/OR graphs
2.5 Models for the Canadian Traveler Problem
2.5.1 POMDP for CTP
2.5.2 Belief State for Representing the Environment of CTP
2.5.3 Belief MDP for CTP
2.6 Related Work
2.6.1 Different Variation of CTP
2.6.2 Disjoint Path Graphs
2.6.3 CTP with Sensing
2.6.4 Propagating AO*
3 Theoretical Analysis of CTP
3.1 CTP with Dependencies
3.2 CTP-Forward-Arcs
3.3 CTP-PATH-DEP
3.4 Theoretical Properties of Belief-MDP for CTP
4 Generalizing PAO*
4.1 General Propagation AO*
4.1.1 Gen-PAO Heuristics
4.1.2 Eliminating Duplicate Nodes
5 Empirical Results
5.1 Varying the Uncertainty of the Graph
5.2 Gen-PAO Heuristic Estimate
5.2.1 Experimental Setting
5.2.2 Varying the Sensing Cost
5.2.3 Varying the Open Probability
5.3 Value of Clairvoyance
6 Summary
6.1 Contributions
6.2 Future work

Chapter 1 Introduction

Planning under uncertainty is one of the most investigated problems in AI. In the real world, efficient navigation requires operation in a partially unknown or dynamically changing environment. Consider a situation where a taxi driver wants to reach his destination in the city in the shortest possible time. The experienced driver knows the road map, and length of each road. Still, the driver does not necessarily have a complete knowledge of the roads’ current status. Some of the roads may be blocked due to traffic jams or police blockades. The driver needs to devise a strategy to reach the destination in the shortest expected time.

A formal model for this kind of problem is the Canadian Traveler Problem (CTP). CTP is a stochastic navigation problem, introduced by [6] where an agent is aimed to travel in a weighted graph $G=(V,E)$ from a source vertex $s\in V$ to a target vertex $t\in V$ . Each time the agent traverses an edge it pays a travel cost which is defined by the edge weight. The agent has complete knowledge of the graph structure and the edge costs. However, some of the edges may be blocked with a known probability. The agent observes such blockage only when the agent physically arrives a vertex that is incident to that edge. The goal is to find a strategy for reaching from $s$ to $t$ that minimizes the expected cost.

[6] showed that finding the optimal solution for the Canadian Traveler Problem was shown to be P $\#$ hard . However some special classes of CTP such as CTP on disjoint path graphs

and CTP on directed acyclic graphs are solvable in polynomial time [5, 3].

In this work, we explore certain variations of the CTP. The first variation introduced is CTP with Dependencies (Dep-CTP). In the original problem, the distribution over the edges is independent. Dep-CTP is a generalization of CTP where we assume that dependencies exist between the status of a particular edge with the status of other edges. Specifically, we are given a Bayesian network that defines the dependencies between the edges. The second variant is CTP with remote sensing(CTP with sensing), introduced by [3]. In CTP with sensing, an agent may perform sensing on any edge, with a given sensing cost, in order to reveal its status. The third variant is Expensive-Edge CTP, a variant of CTP in which edges cannot be blocked, but are expensive and incurs a high travel cost when traversed.

This work contains two different approaches for studying the CTP, by theoretical analysis and by experimental analysis. Regarding the theoretical aspect, we attempt to classify certain classes of Dep-CTP by their computational complexity using probabilistic models as belief-MDP and AND/OR graphs, and we show some general properties for CTP with sensing. Regarding the empirical aspect, we introduce the Gen-PAO algorithm, a generalization of PAO* [1] that optimally solves the CTP, CTP with sensing and Exp-CTP. Gen-PAO uses several pruning methods to reduce the size of the state space search and running time. In addition, we explore the value of clairvoyance which represent the value of having full knowledge of the graph.

The remainder of this work is organized as follows: Chapter 2 contains formal definitions of the Canadian Traveler Problem and its variants. In addition it contains a background for decision and probabilistic models and reviews a number of related algorithms. Chapter 3 shows some proofs concerning the hardness of Dep-CTP and for two of its subclasses. In addition, some theoretical properties concerning the CTP with sensing are shown. Chapter 4 introduces the Gen-PAO algorithm and some of the pruning methods it uses. Chapter 5 provides empirical results, comparing the performance of Gen-PAO and some of its variants. In addition, results concerning the value of Clairvoyance are presented. Chapter 6 summarizes this work and discusses possible directions for future research. Appendix A presents some of the instances in which empirical analysis is used .

Chapter 2 Background

2.1 Markov Decision Process

A Markov Decision Process(MDP) is a framework for sequential stochastic decision problems with a fully observable environment. Formally, MDP is defined by the tuple $<S,A,T,R>$ , where

•

S is a finite set of states where $s\in S$ describes the environment at a specific time step.

•

A is a set of actions.

•

$T:(S\times A\times S)\rightarrow[0,1]$ is the transition function where $T(s,a,s^{\prime})$ specifying the probability of entering a state $s^{\prime}\in S$ given the previous state $s\in S$ and the action $a\in A$ .

•

$R:(S\times A\times S)\rightarrow\mathbf{R}$ is the reward function which specifying the reward $R(s,a,s^{\prime})$ that is received by transitioning from state $s\in S$ to state $s\in S^{\prime}$ after performing action $a\in A$ in s.

At each time step, the agent is state $s\in S$ chooses an action $a\in A$ , reaches the state $s^{\prime}\in S$ with probability T(s,a,s’), and obtain reward R(s,a,s’).A deterministic variant to MDP defines deterministic actions, where each pair of action $a$ and state $s$ specifies deterministically the result state $s^{\prime}$ i.e. There exist a state $s^{\prime}\in S$ in which $T(s,a,s^{\prime})=1$ and for every $\hat{s}\neq s$ $T(s,a,\hat{s})=0$ . The model assumes that the transitions are Markovian in a sense that the probability of reaching a state depends only on the previous state and the action instead of a history of earlier states. The solution of the MDP is a policy. A policy $\pi:S\rightarrow A$ is a mapping from a set of states to a set of actions. At each time step, a given policy is executed, starting from an initial state $s_{0}$ . By having a complete policy, the agent will always know what to do next. However, the stochastic nature of the environment will lead to a different environment history. The decision making problem may be a finite horizon or an infinite horizon. A finite horizon constrains the time steps that the agent exists (or equivalently, considers the rewards after time N as zeros). In this case the utility function is usually an additive reward function:

[TABLE]

However, in infinite horizon the time sequence is unbounded. In infinite horizon the utility function is computed with discount rate $0<\gamma<1$ :

[TABLE]

This utility is called discounted reward. Usually, the performance of an agent is the utility of sequence of states, which is measured by the sum of rewards for the states visited. The utility of a policy $\pi(s)$ in state s is the expected cost over all possible state sequences, starting from s until the MDP terminates. The utility of a policy $\pi(s)$ in finite horizon is computed using dynamic programming:

[TABLE]

Where the utility of a policy $\pi(s)$ in finite horizon is given by,

[TABLE]

In an infinite horizon we usually have a terminal state. The optimal policy is a policy that yields the highest expected utility(or lowest, depends on the specification of the problem). Given a policy $\pi$ , the value of the policy in state s can be computed by an algorithm called value iteration. The value iteration algorithm computes the value of every state s under policy $\pi$ using a reasoning process that goes backwards in time, from the end, in order to determine the optimal sequence of actions. Once choosing the last action, we can determine the best second-last action etc. This process continues until received a best action for all states. We compute the value of each state $s$ under the optimal policy $\pi^{*}$ using the Bellman equations:

[TABLE]

This is process is iterated until it reaches equilibrium which indicates the convergence of the algorithm.

2.1.1 Policy Iteration

Another approach for solving MDP is policy iteration. Policy iteration is a feedback strategy obtained by iterative search in the space of policies. The algorithm is based on two steps: The first step is the evaluation where the algorithm evaluate the values of the states given a set of a action for each state is given by:

[TABLE]

this can be done by solving a set of linear equations. After the values are computed for the given actions, the algorithm makes the second step: improvement. The algorithm considers whether it can improve the policy by choosing a new action for the state. If such action exists, the policy execute the new action.

[TABLE]

The algorithm guarantees that each iteration strictly improves the value of the policy. Therefore, the policy stops when there are no available actions that improve the policy cost. The number of possible policies cannot be more than $|S|^{|}A|$ where $|S|$ is the number of states and $|A|$ is the number of actions. We know that the policy improves at each iteration and the number of possible policies is $|S|^{|}A|$ , thus the algorithm finds the optimal policy within no more than $|S|^{|}A|$ iterations.

2.2 Partially Observable Markov Decision Process

A partially observable Markov decision process (POMDP) is a generalization of the standard MDP, such that the environment is not fully observable, and allows imperfect information about the current state of the environment. In the real world the input may not always be precise where the data may be received with a noise. In robot navigation for instance, the robot will receive its input through sensors which do not describe the environment precisely. Sonar or voice sensors most of the time will probably be a bit noisy and digital video lose information by using a discrete presentation to describe a continuous environment. The POMDP is used as a framework for theoretical decision making and reasoning under uncertainty. Such problems arise in a wide range of application domains including assisting technologies, mobile robotics and preference elicitation. Many of the real POMDP problems are naturally modeled by a continuous states and observations. For instance, in a robot navigation task, the state will correspond to the coordinates in the space and the observations may correspond to the distance measured by the sonar. A common approach to a continuous model requires of discretization and approximation the continuous component of the grid. This usually leads to an important tradeoff between complexity and accuracy with the change of the coarsens of the discretization. On discrete time POMDP, each time period the agent is in some state s, chooses an action a, and receive a reward with expected value. Performing the action, the agent makes a transition to a new state according to some state distribution and observes the environment with a given probability to each state.

Formally, POMDP is an extension of the MDP defined by the tuple $<S,A,T,\Omega,R>$ where: S is a finite set of state that represents the current situation in the environment. A is a set of actions where the agent choose in each state. T(Transition function) is a function that maps $S\times A$ into a distribution over the states $\acute{S}$ . $T(\acute{s}|s,a)$ is the probability to reach $\acute{s}$ where the agent is at state s and perform action a. R is the reward function. R maps any $S\times A\times\acute{S}$ into a number which represents the reward or the penalty. The observation function $\Omega(s^{\prime},a,o)$ describes the probability of observation o given that action a was performed in state s’ was reached.

Generally, in POMDP we do not know the current state. The only information that is given on the environment is the observations. Therefore, POMDP defines a vector of probabilities b(s) in the size of the state set, called belief state, which specify for each state s, the probability that the environment is in s.

Similarly to MDP, The goal of the POMDP is to construct a policy $\pi^{*}$ which maximizes the expected rewards $E[\sum^{T}_{t=o}{\gamma^{t}r(s_{t},a_{t},s^{\prime}_{t})}]$ where T is the number of time steps left in a finite horizon, or $T=\infty$ in an infinite horizon. Since the agent does not know the exact state of the environment, the reward function is given by the belief state i.e. $R(b,a)=\sum_{s\in S}b(s)R(s,a)$ , or in the case of continuous belief space the sum becomes an integral. The belief state of the environment is based on the previous belief state of the environment. Thus, the agent updates the belief b(s’) after being at belief state b(s),choosing action $a$ and receiving an observation $o$ in the following way:

[TABLE]

using the product rule we get

[TABLE]

When we put it all together we get:

[TABLE]

$P(b^{\prime}|o,a,b)=1$ where $P(b^{\prime}|o,a,b)=forward(o,a,b)$

Let L be the number of time $P(b^{\prime}|o,a,b)=1$

Therefore,

[TABLE]

A generalization on the discrete POMDP is where the space of the belief state is continuous. In this case, we still assume the the actions and observation are discrete, the propagation is defined by the integral

[TABLE]

2.2.1 Value Iteration

Defining the probability update and the reward function for belief state we can can transform the POMDP into a belief state MDP by casting the POMDP problem into a fully observable MDP, where the belief state of the POMDP are reduced to simple state of the MDP. The MDP here is continuous and over $|S|$ -dimensional state space. The transformation allows applying a value function for each belief state according to the Bellman equation:

[TABLE]

This means that the value of belief state b is the reward of taking the best action in b plus the discounted expected reward of the resulting belief state $V(b^{a}_{o})$ where $b^{a}_{o}$ is the unique belief state computed based on b,a,o as in equation—. Solving the value iteration by dynamic programing will bring optimal solution at the limit, however, the space size over all the belief states that have to be backed up is enormous. Because exact value iteration is intractable, a lot of work has focused on approximate algorithms. One of the most promising approaches for finding an approximate solution point based value iteration (PBVI). In PBVI instead of optimizing the value function over the entire belief state, only specific reachable beliefs are considered. The belief points are selected heuristically and the values are computed only for these points. The heuristic simulate trajectories in order to find reachable beliefs.The success of PBVI depends on the selection of the belief points. In particular the belief points should cover the space as evenly as possible. The set of belief state is expanded over time in order to cover more of the reachable belief state. Adding more point increases the accuracy of the value function.

The key to practical implementation of a dynamic programming algorithm is a piecewise-linear and convex representation of the value function. The reward function $r(b,a)$ as defined above is linear. The exact solution of POMDP is based on Smallwood and Sondik(1973) proof which takes advantage of the fact that the exact solution is piecewise-linear convex functions and can be represented by $|S|$ hyperplanes in the space of beliefs. Each hyperplane is a value function V over $|S|$ real numbers represented by $V={v^{1},v^{2},...,v^{k}}$ where the value of each belief state is defined as follows:

[TABLE]

Each hyperplane correspond to a single action, and the value iteration updates can be performed directly on these hyperplanes.

2.3 The Canadian Traveler Problem

In the Canadian traveler problem(CTP) [Papadimitriou and Yannakakis, 1991] a traveling agent is given a tuple (G,P,w,s,t) as input where $G=(V,E)$ connected weighted graph that consists initial source vertex ( $s\in V$ ), and a target vertex ( $t\in V$ ). The input graph $G$ may undergo changes, that are not known to the agent, before the agent begins to act, but remains fixed subsequently. In particular, some of the edges in $E$ may become blocked and thus untraversable. Each edge $e$ in $G$ has a weight, or cost, $w(e)$ , and is blocked with a probability $P(e)$ , where $P(e)$ is known to the agent.111Note that it is sufficient to deal only with blocking of edges, since a blocked vertex would have all of its incident edges blocked. The agent can perform move actions along an unblocked edge which incurs a travel cost $w(e)$ . Traditionally, the CTP was defined such that the status of an edge can only be revealed upon arriving at a node incident to that edge, i.e., only local sensing is allowed. In this paper we call this variant the basic CTP variant. The task of the agent is to travel from $s$ to $t$ while aiming to minimize the total travel cost $C_{travel}$ . As the exact travel cost is uncertain until the end, the task is to devise a traveling strategy which yields a small (ideally optimal) expected travel cost.

A somewhat more general version of CTP is CTP with sensing. CTP with sensing is a tuple $(G,P,SC,w,s,t)$ , where in this variant, in addition to move actions (and local sensing), an agent situated at a vertex $v$ can also perform a Sense action and query the status of any edge $e\in G$ . This action is denoted $sense(v,e)$ , and incurs a cost $SC(v,e)$ , or just $SC(e)$ when the cost does not depend on $v$ . The cost function is domain-dependent, as discussed below. The task of the agent is to travel to the goal while minimizing a total cost $C_{total}=C_{travel}+C_{sensing}$ .

We further generalize CTP to allow dependencies between edges, and non-binary edge weight distributions. In this general form, CTP-Gen is a 5-tuple $(G,W,SC,s,t)$ where $G=(V,E)$ is a graph, $W$ is a distribution over weights of the edges $E$ , $SC:V\times E\rightarrow{\mathcal{R}}^{+}$ is a sensing cost function, $s,t\in V$ are the start and goal vertices, respectively. The distribution model $W$ is over random variables indexed by the edges in $E$ , abusing notation we will use the edges in place of the respective random variables. The domain of these random variables are arbitrary weights or cost sets. $W$ is usually specified as a structured distribution model over the random variables $e\in E$ . Henceforth we assume that $W$ is specified as a Bayes network $(E,A,P)$ over these random variables, where $E$ is the set of random variables, $A$ is a set of directed arcs so that $(E,A)$ is a directed acyclic graph, and $P$ are the conditional probability tables, one for each $e\in E$ .

We mostly limit ourselves to the binary case where the edges can be blocked (“infinite weight”) or open (some known weight, possibly different for each edge). In these cases, and to simplify the resentation of the distribution, we use a uniform binary domain $\{Blocked,Unblocked\}$ for the edges, and describe the weight of the (unblocked) edges separately, by a weight function $w:E\rightarrow{\mathcal{R}}^{+}$ . In the degenerate binary case where $W$ is a Bayes network with no arcs ( $A=\emptyset$ ), i.e. all random variables are independent, the problem reduces back to the basic CTP with sensing. In this case we usually specify the distribution as a function $p:E\rightarrow[0,1]$ , the probability that each edge $e\in E$ is blocked.

2.3.1 CTP with Dependencies in Disjoint Path Graphs

As CTP-Gen is extremely complicated, we focus on some special cases w.r.t. the topology of the graph $G$ . Specifically, we examine the basic CTP with no remote sensing where $G$ is a disjoint-path graph (w.r.t. $s,t$ ). As this case is known to be solvable in closed form in polynomial time, we generalize it to the case where edges are dependent, and edge weights are binary (blocked/unblocked) random variables. Thus we consider CTP-DEP, defined by the 5-tuple $(G,W,w,s,t)$ where $G$ is an undirected CTP graph, $W$ is a Bayes network representing the edge blocking distribution model, $w$ is a function denoting the edge weights (for unblocked edges), and $s,t$ are the start and goal vertices respectively, as usual.

As we will show, finding an optimal problem for CTP-DEP is intractable even for special cases, and we will thus consider cases where $W$ has dependencies only between edges on the same path. Thus the Bayes network $W$ representing the distribution model has one (or more) unconnected component, for each set of edges composing a path. We call this simplified variant CTP-PATH-DEP.

In disjoint path graph, we index the edges such that each edges has two indexes where the first index indicates the path and second index indicates the serial location of vertex in the path. For instance $e_{i,1},e_{i,2},...e_{i,ki}$ are the edges composing the $i$ ’th path. Similarly to edges, we index the vertices such that the first vertex indicates the path and second index indicate the serial location of the vertex in the path. $s,v_{i,1},v_{i,2},...v_{i,mi},t$ are the vertices composing the $i$ ’th path. Note that each edge $e_{i,j}$ can be represented by $(v_{i,j-1},v_{i,j})$

2.4 AND/OR Graphs

Many problems in artificial intelligence can be formulated as a framework for problem solving in a state space search. The AND/OR graph is a directed graph that represent a problem solving process. The solution of AND/OR is a sub-graph of the AND/OR, called solution graph, that is a derivation for the optimal solution of the original problem. In this work, we use the AND/OR graph for finding optimal solutions to probabilistic reasoning problems and CTP in particular. With a slight abuse of notation we use the same notation graph to indicate both CTP graph and AND/OR graph.

Formally, an AND/OR graph is a tuple $G_{AO}=(N_{AND}\cup N_{OR},E,T,c,p)$ defines as follows:

•

$N=N_{AND}\cup N_{OR}$ where $N_{AND},N_{OR}$ are finite sets of nodes.

•

$T\subset N_{OR}$ is the set of terminal leaf nodes.

•

$E\subset(N_{AND}\times N_{OR})\bigcup(N_{OR}\times T\backslash N_{AND})$ is a set of directed edges between the nodes.

•

$c:E\cup T\rightarrow\mathcal{R}$ is a cost function over the edges and terminal cost.

•

The graph associate probabilities $p(n,n_{0})$ over the edges $(n,n_{0})$ such that, for every $n\in N_{AND}$ $\sum_{\left\langle n,n^{\prime}\right\rangle\in E}{p\left\langle n,n^{\prime}\right\rangle}=1$ .

The root node of $G_{AO}$ is denoted by $n_{0}$ . A policy graph of the AND/OR graph is a subgraph $H=(N_{H},E_{H})$ of $G_{AO}$ such that

•

$n_{0}\in N_{H}$

•

If $n\in N_{H}\backslash T$ is AND node, all its children are in H.

•

If $n\in N_{H}$ is OR node then only one of its children is in H.

•

Every leaf node (node with no children) in $G_{AO}$ is terminal.

and,

•

If $n\in N_{H}\backslash T$ is AND node, all outgoing edges are in $E_{H}$ .

•

If $n\in N_{H}$ is OR node, and $n^{\prime}\in H$ is a child of n, then $\left\langle n,n^{\prime}\right\rangle\in E_{H}$ .

Define $subpolicy$ of node $n$ denoted by $SH_{n}$ to be a subgraph of $G_{AO}$ that satisfies all properties of $H$ except that the root of $SH_{n}$ is an arbitrary node $n$ instead of $n_{0}$ .

The value of each node $n\in N_{H}$ is defined as follows:

[TABLE]

The child of OR node with the minimal value is called called preferred son. The cost of the policy graph is defined as $C_{H}(n_{0})$ . The policy graph is optimal if there are no other policy graphs with lower cost.

We define a policy subgraph to be a subgraph of $N_{H}$ such that

•

$n_{0}\in N_{H}$

•

If $n\in N_{H}\backslash T$ is AND node, all its children are in H.

•

If $n\in N_{H}$ is OR node then only one of its children is in H.

We define policy subgraph $H_{s}$ to be a subgraph of $G_{AO}$ that satisfies all properties except that the leafs are not necessarily terminal. The cost of the policy subgraph is defined as $C_{H_{s}}(n_{0})$ . The best policy subgraph is a policy subgraph with the minimal cost in the AND/OR graph.

2.4.1 AO*

The AO* is an heuristic based search algorithm that performs a search in the AND/OR graph for finding the optimal policy graph. The AO* performs a search in the AND/OR graph, gradually building up a partial policy graph, assigning heuristic values to the leaves, and propagating the heuristic values up to the root. The heuristics, used to evaluate the real cost of the nodes in AND/OR graph, are admissible, and therefore, finding the optimal solution is guaranteed. The AO* is beneficial when solving problems with a large state space. The AO* algorithm assumes that the AND/OR graph that represents the problem is not given, however the algorithm construct the AND/OR graph by expanding it each iteration, and thereby develop the optimal policy graph subgraph each iteration. The process ends when all the leaf nodes of the partial policy graph are terminal.

The AO* takes advantage of the fact that once a node is known to be in the optimal policy graph it does not required any further expansion. Thus, the algorithm maintains a boolean parameter called ”SOLVED” for each node in the AND/OR graph which signs the algorithm if node is a part of the optimal policy, i.e. a node n is set SOLVED, performed by the operation MarkSolved(n), if $n$ is known to be in the optimal policy subgraph. Once a node is SOLVED, it remains SOLVED. A node $n$ is SOLVED if and only if all the nodes in the subpolicy $SH_{n}$ spanned from $n$ , are solved. Hence, when a node $n$ is set SOLVED, the subpolicy $SH_{n}$ spanned from this node does not require any further update or expansion. Implementing the “solving” process, the AO* performs MarkSolved(n), if node $n$ satisfies one of the following:

•

$n$ is a terminal node

•

$n$ is an AND node and all its children are are set SOLVED.

•

$n$ is an OR node and its preferred son is set SOLVED.

Basically, each iteration of the AO* algorithm has two phases: expansion and propagation, described as follows:

•

Expansion phase:

Trace down the marked edges (directed edges) from $n_{0}$ and go downwards until reaching a non-terminal leaf node n and expands it. (Finding the expansion nodes requires recurrence exploration through the AND/OR graph since the partial policy graph is changing each iteration.) 2. 2.

For each child $n_{i}\in n_{1},n_{2},...,n_{k}$ of n, if $n_{i}$ has not been generated, then add it to the policy graph and assign it admissible heuristic. If $n_{i}$ is a terminal node then assign 0 to its heuristic value, and perform $MarkSolved(n_{i})$ .

•

propagation phase: In the propagation phase, the heuristic values and marked edges of the expansion nodes are propagated from the leaves onward up to the root. The propagation processed as follows:

1

If n is OR node then its heuristic value is updated by,

[TABLE]

The marked edge is directed from n to the child $n_{i}$ which achieves the minimum in equation 2.3, and n is set SOLVED if and only if $n_{i}$ is set SOLVED. 2. 2

If n is AND node then its heuristic value is updated by,

[TABLE]

The node n is set SOLVED if and only if all its children are set SOLVED.

The procedure of updating the heuristic values and marking the edges is repeated for all nodes ancestors of n.

Properties of the AO*:

•

The heuristic values are optimistic estimations (lower bound) to the real value of the state, where each update raises up the heuristic value and reduces its imprecision relatively to the real value.

•

The AO* is beneficial when it applied to a large state space. One reason for this is that AO* considers only states that are reachable from the initial state. Secondly, the informative heuristic function directs the focus on states that are in the course of a good policy graph(partial policy graph). As a result, the AO* may find an optimal solution by exploring a small fraction of the entire state space.

2.4.2 CTP and AND/OR graphs

AND/OR graph is a natural structure for representing the state space of CTP, where the policy of CTP is represented by the policy graph. The problem solving process is a search for an optimal policy graph in a policy graph space. Here, the OR nodes represents the agent’s decision in a current state out of all its available actions. Where, in basic CTP, the available actions are all the moves available from a certain vertex, while in Sensing-CTP, the available actions are all the available remote sensing actions, in addition to all available moves from a certain vertex(the remote sensing is defined available if it is performed on an unknown edge). The AND nodes represent the actions. Since the CTP is a stochastic problem, each action may result several possible states, which is represented by the AND node’s children. The states of the environment in CTP are represented by the OR nodes. The states are the belief states of the agent in a current time step, where each belief state is represented by its $form$ (i.e every belief state $b$ is represented by the tuple $\left\langle b\right\rangle=\left\langle Loc(b),stb(e,b_{1}),...,stb(e,b_{n})\right\rangle$ ). Henceforth, all functions,predicate and lemmas presented in section 3.4 can be applied to the states in the AND/OR graph. We call the set of states that appears in the AND/OR graph the expanded states and denote it by $Z$ . Although the AND nodes do not represent the states(they are called “semi state”), they maintain heuristic values as described in AO* algorithm which is specified for propagation. Since the environment is static, once an agent observes an edge, its status is remained unchanged. A terminal state $b$ is a state in which its location variable is the target( $Loc(b)=t$ ). A node is a terminal leaf node if the state with which it associate is terminal.

Definition 2.4.1.

A belief state $b$ is expanded belief state if there is an OR node in the AND/OR graph that is associated with $b$ .

2.5 Models for the Canadian Traveler Problem

2.5.1 POMDP for CTP

In this section we show that CTP can be modeled by POMDP. Let $I=(G,P,w,\hat{s},\hat{t})$ be an instance of of basic CTP, and $I^{\prime}=(G,P,w,SC,\hat{s},\hat{t})$ be an instance of CTP-with-sensing, where $G=(V,E)$ . Given POMDP $M=(S,A,Tr,R,Z,O,s_{0})$ , we show how I and I’ can be modeled by M as follows:

•

The state space S. The states space S of I (or I’) represent the possible environment of the world. Each state $s$ indicates the location of the agent, and the status of all edges in E. The location of the agent in state s is denoted by $Loc_{S}(s)$ where $v=Loc_{S}(s)$ is the vertex that the agent is situated. Each edge $e\in E$ is associated with a status variable $st_{e}\in\{O_{e},B_{e}\}$ where $st_{e}(s)=O_{e}$ indicates that $e$ is open in s, and $st_{e}(s)=Blocked$ indicates that $e$ is blocked in s. Thus, we define the state space S to be $V\times\prod_{e\in E}{\{O_{e},B_{e}\}}$ .

•

The action set A. In the basic CTP the set of actions A includes only one type of action $a=Move(e)$ in which agent that performs $a$ , moves along an edge $e\in E$ if $e$ is open. While in the CTP-with-sensing, the set of actions A includes in addition to the Move actions, the sensing actions, in which agent that performs $a=Sense(e)$ , senses an edge $e\in E$ . This action can be performed from any vertex $v\in V$ .

•

The transition function Tr. Given $s,s^{\prime}\in S$ , and $a=move(v_{i},v_{j})$ , we define Tr by the following: $Tr(s,a,s^{\prime})=1$ if it satisfies the following:

–

For all edges $e\in E$ the status of the edge in s is equal to the state in s’, i.e $st_{e}(s)=st_{e}(s^{\prime})$ .

–

$v_{i}=Loc_{S}(s)$ and $v_{j}=Loc_{S}(s^{\prime})$ .

–

The edge e=( $v_{i},v_{j}$ ) is open in s, i.e. $st_{e}(s)=O_{e}$ .

Otherwise Tr(s,a,s’)=0.

If $a=Sense(e)$ where $e\in E$ we get $Tr(s,a,s^{\prime})=1$ if and only if $s=s^{\prime}$ since the Sense action does not change the state of the environment.

•

The reward(cost) function R. Given $s,s^{\prime}\in S$ , we define R as follows: In case that $a=move(e)$ then,

[TABLE]

In case that $a=Sense(e)$ then,

[TABLE]

Notation 2.5.1.

Let X be a set. Denote the power set of X by $\mathcal{P}(X)$ .

•

The observation set Z. Let $Z^{\prime}=\{O_{e},B_{e}\}^{E}$ . We define Z to be the power set of Z’, Namely $Z=\mathcal{P}(Z^{\prime})$ .

•

The observation function O. Given $s\in S$ , $a\in A$ and $z\in Z$ we define O as follows: In case that $a=move(e)$ where $e=(v_{i},v_{j})$ , the only observation that received are the edges incident to vertex $v_{j}$ , then,

[TABLE]

Where in case that $a=sense(e)$ , the only observation that received is the sensed edge e, then,

[TABLE]

•

$s_{0}$ is the initial state.

Notation 2.5.2.

The optimal policy of $M_{S}$ is denoted by $\pi^{*}$ .

Notation 2.5.3.

Let X be a set. Denote the power set of X by $\mathcal{P}(X)$ .

2.5.2 Belief State for Representing the Environment of CTP

A belief state, which is defined as a distribution over all possible states, is a representation of the agent’s knowledge about the environment. In CTP, the belief states can be represented by the location of the agent and the status of each edge in the graph.

Definition 2.5.4.

We say that status of edge $e$ is:

•

“known to be blocked” if $e$ has been already sensed and found to be blocked.

•

“known to be open” if $e$ has been already sensed and found to be open.

•

“unknown” if $e$ has not been sensed.

Definition 2.5.5.

*Define $stb:E\times B\rightarrow\{Open,Blocked,Unkown\}$ as follows:

$stb(e,b)$ is the edge status of $e$ in belief state $b$ , where*

[TABLE]

Definition 2.5.6.

Define $Loc:B\rightarrow V$ as the location of an agent in a belief state, where $Loc(b)$ outputs the physical location of an agent that is in belief state $b$ ,i.e. $Loc(b)=Loc_{S}(s)$ where $s$ is an arbitrary state $s\in S$ which satisfies $b(s)>0$ .

Note that definition 2.5.6 assumes that there cannot be two state $s_{1},s_{2}$ which satisfy $b(s_{1})>0$ , $b(s_{2})>0$ such that $Loc_{S}(s_{1})\neq Loc_{S}(s_{2})$ since by definition, the agent always knows its own location, and thus, for every belief state $b$ , if exists $s\in S$ in which $Loc(b)\neq Loc_{S}(s)$ then $b(s)=0$ .

Thus, we can define an alternative way for representing a belief state,

Definition 2.5.7.

Let $n=|E|$ . The form of $b$ , denoted by $\left\langle b\right\rangle$ , is defined to be the tuple $\left\langle b\right\rangle=\left\langle Loc(b),stb(e_{1},b),stb(e_{2},b),...,,stb(e_{n},b)\right\rangle$ ,

Definition 2.5.8.

Let $b$ be a belief state, we define the following sets:

$Unknown(b)$ * is the set of all edges $e\in E$ in which $stb(e,b)=Unknown$ * 2. 2.

$Blocked(b)$ * is the set of all edges $e\in E$ in which $stb(e,b)=Blocked$ * 3. 3.

$Open(b)$ * is the set of all edges $e\in E$ in which $stb(e,b)=Open$ *

Let $b$ be a belief state. Then, there is a mapping from $\left\langle b\right\rangle$ to $b$ . Namely, for every $s\in S$ there is a mapping from $\left\langle b\right\rangle=\left\langle Loc(b),stb(e_{1},b),stb(e_{2},b),...,,stb(e_{n},b)\right\rangle$ to $b(s)$ as follows:

[TABLE]

Where $G:E\times S\times B\rightarrow\{0,1\}$ is defined as follows:

$G(e,s,b)=0$ if one of the following is satisfied:

$st(e,s)=Open,stb(e,b)=Blocked$ 2. 2.

$st(e,s)=Blocked,stb(e,b)=Open$ 3. 3.

$Loc(b)\neq Loc_{S}(s)$

otherwise $G(s,b)=1$

Corollary 2.5.9.

Since there is a mapping from $\left\langle b\right\rangle$ to $b$ , we can use the form $\left\langle b\right\rangle$ instead of the belief state $b$ itself, for representing the belief state of an agent.

Definition 2.5.10.

Let $P_{b}(e,b)$ be the probability that edge $e$ is blocked given that the agent is in belief state $b$ . Namely $P_{b}(e,b)=\sum_{i}{b(s_{i})}$ such that $st(e,s_{i})=Blocked$ .

In the basic variant of CTP, the probabilities associated with the edges are independent, and hence, as long as $stb(e,b)=Unknwon$ , we have $P_{b}(e,b)=P(e)$ .

Corollary 2.5.11.

From definition 2.5.10 we get that the probability that edge $e$ is blocked given the agent is in belief state $b$ is given by:

[TABLE]

2.5.3 Belief MDP for CTP

Given a POMDP $M=(S,A,Tr,R,Z,O,s_{0})$ of instance $I=(G,P,w,\hat{s},\hat{t})$ of CTP (or $I^{\prime}=(G,P,w,SC,\hat{s},\hat{t})$ of CTP-with sensing). Let $B$ be the belief state space of M, we define a belief MDP $M_{S}=(B,A,Tr,R,b_{0})$ of I, based on M, where the states space B is over the state space S. CTP is a special case of POMDP(called Det-POMDP) where transition function Tr and reward function R can be simplified here as follows:

•

The transition function Tr. In general belief MDP, given $b,b^{\prime}\in B$ , $a\in A$ , Tr is given by:

[TABLE]

Given $a=move(v_{i},v_{j})$ , we define $\hat{E}$ to be the set of all edges incident to $v_{j}$ that are unknown in $b$ and known in $b^{\prime}$ (the edges that are revealed by the local sensing), i.e.

[TABLE]

Then $Tr(b^{\prime},a,b)>0$ if and only if,

–

For all $e\in E\backslash\hat{E}$ $stb(b,e)=stb(e,b^{\prime})$ . The status of the edges do not change as well as the information about any unraveled edge that is not sensed.

–

$v_{i}=Loc(b)$ and $v_{j}=Loc(b^{\prime})$ .

–

The edge e=( $v_{i},v_{j}$ ) is open in b, i.e. $stb(e,b)=Open$ .The edge has to be open in order to traverse it.

In this case,

[TABLE]

Given $a=Sense(e^{\prime})$ , $Tr(b^{\prime},a,b)>0$ if and only if,

–

$stb(b^{\prime},e^{\prime})\neq Unknown$ . The edge e’ is known after the performing Sense(e’).

–

For all $e\in E\backslash e^{\prime}$ $stb(b,e)=stb(e,b^{\prime})$ . The state is not effected by Sense action and the only information that received is the status of e’.

–

$Loc(b)=Loc(b^{\prime})$ . The location is not effected by Sense action.

In this case,

[TABLE]

•

The reward function R. In general, the reward function is defined by $R(b,a,b)=\sum_{b^{\prime}\in B}{b(b^{\prime})}\sum_{b\in B}{b(b)R(b,a,b^{\prime})}$ . Denote the action cost of $a$ by $C(a)$ , where $C(a)=w(e)$ if $a=Move(e)$ and $C(a)=SC(e)$ if $a=Sense(e)$ . Hence,

[TABLE]

We define $R(b,a)=\sum_{b^{\prime}\in B}{R(b,a,b^{\prime})Tr(b,a,b^{\prime})}$ . Therefore, $R(b,a)=C(a)$ if there exist ${b^{\prime}\in B}$ such that $Tr(b,a,b^{\prime})>0$ . Otherwise $R(b,a)=0$ . Note that in case that $a=Sense(e)$ , there always exist $b^{\prime}$ reachable from $b$ in which $Tr(b,a,b^{\prime})>0$ , thus $R(b,a,b^{\prime})=SC(e)$ always holds.

•

$b_{0}$ is the initial belief state.

Definition 2.5.12.

We say that action $a$ can be performed in belief state $b$ if there is a belief state $b^{\prime}$ such that $Tr(b,a,b^{\prime})>0$

2.6 Related Work

2.6.1 Different Variation of CTP

The Canadian traveler problem is known to be $p\#$ hard [6]. In the lack of approximation solutions, different versions of special classes of graphs have been suggested where the exact solution can be found in polynomial time. [2] have investigated the case of Recoverable CTP , where each vertex is associated with a specific recovery time to reopen any blocked edge that is incident to it. When an agent finds a blocked edge $e$ it can either traverse another edge or wait a period of time and check if $e$ has been opened. The basic CTP is a special case of the Recoverable CTP where all the recovery times are infinitely large. There are two variation to the Recoverable CTP, deterministic and stochastic. In the deterministic variation the assumption is that the number of edges that may be blocked is bounded. In the stochastic variation, each edge is associated with a probability of being blocked while it assumes that the recovery time is not long relative to the travel time. The two cases were proved to be polynomial in the number of edges and vertexes and in the maximal number of blocked edges. [5] investigated a CTP variant where the environment is dynamic, in a sense that the status of each edge $e\in E$ is generated randomly with a given probability $P(e)$ whenever the agent reaches an incident vertex of $e$ . This variant can be modeled by MDP, where the states represent only the current location of the agent. Since MDP is solvable in polynomial time, this variant is solvable in polynomial time as well. Notice that basic CTP is much harder to solve, since the edges status is remained fixed and thus the state space is exponentially larger (in the number of edges). Nikolova et al. have shown that CTP on directed acyclic graph (DAG) can be solved in polynomial time by using a dynamic programming.

2.6.2 Disjoint Path Graphs

Disjoint path graph is an undirected graph $G=(V,E)$ with source $s\in V$ and destination $t\in V$ such that all paths $p_{1},...,p_{k}$ in G are between $s$ and $t$ , and these paths are pairwise disjoint. [3] have shown that CTP on disjoint path graph is solvable in polynomial time. The proof is based on the property the the optimal policy is committing . This guarantees that whenever an agent follows a path, the optimal action is to continue the path until reaching the target unless it hits a blocked edge. The optimal policy of CTP on disjoint path is to follow the paths by their order of $D_{i}$ ( $D_{i}$ is parameter associated with each path in G) Meaning, the optimal policy is to travel the path with the minimal $D_{i}$ till reaching the target unless the path is blocked. If the path is blocked then return to $s$ and travel the path with second minimal $D_{i}$ and so on. $D_{i}$ is defined as,

[TABLE]

Where $BC_{i}$ denotes the backtracking cost of path i which is the cost of traversing path i until hitting a blocked edge and then returning back to the $s$ when the path is not traversable , or 0 when the path is traversable. The expected cost of $BC_{i}$ is

[TABLE]

Where $W_{i,j}=\sum_{j}^{m=1}{w\;(e_{im})}$ .

Another variation of CTP on disjoint path graphs is when the edges cannot be blocked but instead have two possible finite costs: a cheap and and expensive [5]. A simple case of this variation is when the value edges is binary, i.e., 0 or 1. In this case the optimal policy would be to explore all the edges with cost 0 of each path until it reaches an edge with cost 1 on the path, and then return to the path with the fewest unexplored edges and follow it until reaching the target. A more general case of this variation is when the edges are associated with the cost 1 or K. In this case the optimal policy has the property that once an edge with cost $k$ has been crossed, it is optimal to continue along the same path until reaching the target. Taking advantage of the special structure of the policy induced by this property, allows to define an MDP with concise representation that decides in what order to explore the paths and how many, before committing a path. This two cases were proved to be solvable in polynomial time.

2.6.3 CTP with Sensing

The CTP with sensing is a harder problem than the basic CTP since a simple reduction can be constructed from any instance of CTP: The graph of the basic CTP is the graph of the CTP with sensing, however, the sensing cost of all edge are large enough, such that sensing an edge is never worthwhile. As such, the expected cost of the two optimal policies is equal.

Heuristic search algorithms

In order to facilitate the search for solution of CTP with sensing, some heuristic based algorithm have been suggested. The algorithms do not provide an optimal solution, however, they may be much simpler. [3] have suggested the FSSN algorithm that is based on the free space assumption heuristic. The free space assumption [4] assumes that edges are traversable unless specifically known otherwise. The FSSN plans a path $p$ from some vertex $v\in V$ to $t$ with the shortest path under the free space assumption. The agent can either attempt to traverse P without sensing or may decide to interleave sensing actions into the movement actions, according to a sensing policy that is embedded in the algorithm.

Number of sensing policies to the FSSN have been suggested:

Never Sense is a brute force policy that never senses any remote edge. This policy never incurs any sensing cost but it may lead to an increase travel cost.

Always Sense is a brute force policy that senses all the unknown edges in the path before it moves along it.

Value of information a policy that decides what edges to be sensed according to their value of information.

2.6.4 Propagating AO*

AO* harness the benefits of the heuristic search to avoid searching states that are undesirable. However, in many situations AO* examines far more states than necessary. Propagating AO*(PAO*) [1] is an extension of the AO* that takes one step forward for facilitating the search. PAO* propagates the heuristic values on a larger scale in which minimizes the expansion of non-terminal nodes. PAO* is based on a specific variation of the AO* algorithm; Ferguson et al. constructed an algorithm that solves a variation of the CTP where most of the graph (edges) is observable such that only a single unknown edge (called pinch point) can be incident to a vertex (In the original paper the pinch points are called “faces” ). As such, any chance node (AND node) has at most two children that represent a traversable edge and a blocked edge. PAO* is described as follows: The expansion phase is processed exactly as the AO* where the PAO* grows the best partial policy graph by expanding the non terminal leaf nodes, and assigning heuristic values to its children. Similarly to AO*, PAO* propagates the heuristic values onward up to the root. However, PAO* propagate the heuristic values sideways and downwards to the children as well. Furthermore, the algorithm takes advantage of the fact that the AND node has only two children (traversable and blocked) such that the parent node heuristic value should never be less than the traversable child value. Thus, PAO* propagate the heuristic value of the parent to the traversable child if the heuristic value of the traversable child is higher. Similarly, the heuristic value of the parent should never be greater than the value of the blocked child. Therefore, PAO* propagate the value to the non-traversable child in case that the heuristic value of the parent is higher.

Chapter 3 Theoretical Analysis of CTP

3.1 CTP with Dependencies

Theorem 3.1.1.

CTP with dependencies is at least as hard as CTP with sensing.

Proof outline: By reduction from CTP-with-sensing to CTP-with-dependencies.

Proof:

Proof. Let I=(G,W,C,SC,s,t) be an instance of CTP-with-sensing. We construct an equivalent instance I’=(G’,W’,C’,s’,t’) of CTP-with-dependencies and show that there is a one-to one equivalence between I and I’. Construction of I’ is as follows, G’ contains G entirely, and in addition, each vertex in G is attached to two-edge dead-end path, that simulate the sensing operation of I. One path for each possible sensing operation in I.

Formally, the construction of I’ is as follows:

First, we construct $\hat{G}(\hat{V},\hat{E})$ by copying the graph G(V,E) using the following functions:

•

$g_{v}:V\rightarrow\hat{V}$ is a bijection function that copies V into $\hat{V}$ such that for each $v_{i}\in V$ , $g_{v}(v_{i})$ is the copied vertex of $v_{i}$ .

•

$g_{e}:E\rightarrow\hat{E}$ is a bijection function that copies E into $\hat{E}$ such that for each $e_{j}\in E$ , $g_{e}(e_{j})$ is the copied edge of $e_{j}$ .

Let $\hat{V}$ be the set of all the vertices that were copied from V, meaning $\hat{V}=\bigcup_{1\leq i\leq n}{g_{v}(v_{i})}$ . Let $\hat{E}$ be the set of all the edges that were copied from E, meaning $\hat{E}=\bigcup_{1\leq j\leq m}{g_{e}(e_{j})}$ .

Notation 3.1.2.

$\hat{v}_{i}\in\hat{V}$ * denote the copied vertex $f(v_{i})$ .*

Notation 3.1.3.

$\hat{e}_{i}\in\hat{E}$ * denote the copied edge $g_{e}(e_{i})$ *

We construct a new graph G’(V’,E’) by extending $\hat{V}$ and $\hat{E}$ using the following functions:

•

$f_{v1}:\hat{E}\times\hat{V}\rightarrow V$ and $f_{v2}:\hat{E}\times\hat{V}\rightarrow V$ are one to one functions that generates a vertex for each element in $\hat{E}\times\hat{V}$ . Meaning, given $v_{i}\in\hat{V}$ and $e_{i}\in\hat{E}$ , $f_{v1}(v_{i},e_{j})=v_{ij1},f_{v2}(v_{i},e_{j})=v_{ij2}$ .

•

$f_{e1}:\hat{E}\times\hat{V}\rightarrow E$ and $f_{e2}:\hat{E}\times\hat{V}\rightarrow E$ are one to one functions that generates an edge for each element in $\hat{E}\times\hat{V}$ such that given $v_{i}\in\hat{V}$ and $e_{i}\in\hat{E}$ , $f_{e1}(v_{i},e_{j})=e_{ij1},f_{e2}(v_{i},e_{j})=e_{ij2}$ and in addition, $e_{ij1}=(v_{i},v_{ij1})$ and $e_{ij2}=(v_{ij1},v_{ij2})$ .

Let graph $G^{\prime}(V^{\prime},E^{\prime})$ defined as follows:

[TABLE]

where,

[TABLE]

Notation 3.1.4.

Given $e_{ij1}\in E_{ij1},e_{ij2}\in E_{ij1}$ , we define a two edge dead end path $p_{ij}=\left\langle e_{ij1},e_{ij2}\right\rangle$ .

Note that $G^{\prime}(V^{\prime},E^{\prime})$ can be viewed as ”attachment” of paths $\bigcup_{1<j\eqslantless|E|}{P_{ij}}$ to each $g(v_{i})\in\hat{V}$ .

$W(X,Y)$ is the Bayesian network that is associated with edges in G where X is the set of nodes and Y is the set of arcs. Similarly $W^{\prime}(X^{\prime},Y^{\prime})$ is the Bayesian network that is associated with edges in G’ where X’ is the set of nodes and Y’ is the set of arcs. Let x be a node in X, and x’ be a node in X’. We define W’ by W as follows:

•

For each $x\in X$ , $x^{\prime}_{g_{e}(e_{i})}=0\Leftrightarrow x_{e_{i}}=0$ ( $st_{e_{i}}=st_{g_{e}(e_{i})}$ ).

•

For each $e^{\prime}_{ij1}\in E_{ij1}$ , $P(x^{\prime}_{e^{\prime}_{ij1}}=0)=1$ , i.e. all edges in $E_{ij1}$ are open.

•

For each j, $x^{\prime}_{e^{\prime}_{1j2}}=0,x^{\prime}_{e^{\prime}_{2j2}}=0,...,x^{\prime}_{e^{\prime}_{nj2}}=0\Leftrightarrow x^{\prime}_{g_{e}(e_{j})}=0$ , i.e. for each j, all edges $e^{\prime}_{ij2}\in E^{\prime}_{ij2}$ are open if and only if $x_{g_{e}(e_{j})}$ is open.

The weight function $C^{\prime}$ is defined by:

$\forall e_{i}\in E,C^{\prime}(g_{e}(e_{i}))=C(e_{i})$ . 2. 2.

$\forall e_{ij1}\in E^{\prime}_{ij1},C^{\prime}(e_{ij1})=\frac{SC(e_{j})}{2}$ where $SC(e_{j})$ denote the sensing cost of $e_{j}$ . 3. 3.

$\forall e_{ij2}\in E^{\prime}_{ij2},C^{\prime}(e_{ij2})=\infty$ .

The computational time that takes to generate this reduction is polynomial, since the size of $|E^{\prime}|=|E^{\prime}_{ij1}|+|E^{\prime}_{ij2}|+|\hat{E}|$ , where $|E^{\prime}_{ij1}|=|E^{\prime}_{ij2}|=|E|\times|V|$ and $|\hat{E}|=|E|$ , therefore $|E^{\prime}|=|E|+2|E|\cdot|V|=|E|(1+2|V|)$ . Furthermore, the size of $|X^{\prime}|$ is $|E^{\prime}|$ since each node in X’ is associated with an edge in E’ , and $|Y|=|V||E|$ since each node in X’ that is associated with edge in $\hat{E}$ is connected to $|E|$ nodes.

Let $M_{1}=(S_{1},A_{1},Z_{1},Tr_{1},O_{1},R_{1})$ be a POMDP that modes I, where $S_{1},A_{1},Z_{1},Tr_{1},O_{1},R_{1}$ are finite sets of states, actions, observations, transition functions, observation functions and reward functions respectively. Similarly, let $M_{2}=(S_{2},A_{2},Z_{2},Tr_{2},O_{2},R_{2})$ be a POMDP that models I’ where $Z_{2}$ is the set of observations in I’, $S_{2}$ is special subsets of the states set in I’, and $A_{2}$ is a special meta-action set in I’(a set of series of action in I’) which will be defined later. $Tr_{2},O_{2},R_{2}$ are the transition functions, observation functions and reward functions in I’. Let $\pi_{1}^{*}$ be the optimal policy of I and $\pi_{2}^{{}^{\prime}*}$ be the optimal policy of I’. In order to prove theorem 3.1.1, it is suffice to show that $Exp(\pi_{1}^{*})=Exp(\pi_{2}^{*})$ . In the remainder of this proof we prove this property by showing that $M_{1}$ is equivalent to $M_{2}$ and that $M_{2}$ actually models I’.

We want to define the subset $S_{2}\subset S^{\prime}$ that contains all states in which the agent is located in a ”‘copied”’ vertex. Formally,

Definition 3.1.5.

*Given $S^{\prime}$ , we define $S_{2}$ to be the subset of S’ such that $s_{i}\in S_{2}$ if and only if $s_{i}\in S^{\prime}$ and $Loc(s_{i})\in\hat{V}$ . Meaning $S_{2}=\{s_{i}|s_{i}\in S^{\prime},Loc(s_{i})\in\hat{V}\}$ . *

Lemma 3.1.6.

Let $\tilde{V}$ be the location space of $S_{2}$ , (i.e. $\tilde{V}=\{v|v=Loc(s_{2}),s_{2}\in S_{2}\}$ ) then $\tilde{V}=\hat{V}$ .

Proof.

$=>$ $\tilde{V}\subset\hat{V}$ . According to $\tilde{V}$ , for every $s_{i}\in S^{\prime}$ $Loc(s_{i})\in\tilde{V}$ if $Loc(s_{i})\in\hat{V}$ .

$<=$ $\hat{V}\subset\tilde{V}$ . V’ contains all possible locations that agent can be in G’, where $\tilde{V}$ is subset of V’ which contains all possible location in $\hat{G}$ . Thus every element in $v_{i}\hat{V}$ is in $\tilde{V}$ ∎

Corollary 3.1.7.

$\hat{V}$ * is the location space of $S_{2}$ .*

Now, we want to show that $S_{2}$ and $S_{1}$ are equivalent in a sense that there exist a one-to-one correspondence between $S_{2}$ and $S_{1}$ . In order to show that we need to make the following definitions and statements,

Definition 3.1.8.

*We define EStatusSet(S) be the set of $Estatus(s_{i})$ of all elements $s_{i}\in S$ , meaning $EstatusSet(S)=\{Estatus(s_{i})|s_{i}\in S\}$ . *

In fact, $EstatusSet(S)$ is the set of all the possible status vectors of E, as such $EstatusSet(S)=\{Open,Blocked\}^{|E|}$ .

Lemma 3.1.9.

There exist a one-to-one correspondence between V and $\hat{V}$

Proof.

Since $g_{v}$ is a bijection, the exist a one to one correspondence between $v_{i}\in V$ and $g_{v}(v_{i})\in\hat{V}$ ∎

Lemma 3.1.10.

There exists a one-to-one correspondence between $EstatusSet(S_{1})$ and $EstatusSet(S_{2})$ .

Proof.

There exist a one to one correspondence between $EstatusSet(S_{1})$ and $EstatusSet(S^{\prime})$ since for all $s_{1}\in S_{1},s^{\prime}\in S^{\prime}$ , each element $Estatus(s_{1})\in EstatusSet(S_{1})$ can be mapped into a different element $Estatus(s^{\prime})\in EstatusSet(S^{\prime})$ . This is due to the following facts:

1

(injective)According to definition of W’, $\forall{i}\;x^{\prime}_{g_{e}(e_{i})}=0\Leftrightarrow x_{e_{i}}=0$ , in other words every edge $e_{i}\in E$ has equal status as its copied edge $g_{e}(e_{i})$ and thus there exist a one to one correspondence between each set of edge status $Estatus(s_{1})$ and each set of edge status $ste(\hat{E},s_{1})$ (In fact for each $s_{1}\in S_{1}$ $Estatus(s_{1})=ste(\hat{E},s_{1})$ ). Since for every $s^{\prime}\in S^{\prime}$ , $ste(\hat{E},s^{\prime})$ is a subset of $EstatusSet(s^{\prime})$ , there exist a one to one mapping between $EstatusSet(S_{1})$ and $EstatusSet(S^{\prime})$ 2. 2

(surjective) The status of edges $ste(E^{\prime}-\hat{E},s^{\prime})$ is completely determined and unique, given edges status of edges $ste(\hat{E},s^{\prime})$ ,

i.e. $ste(\hat{E},s^{\prime})=\{st(g_{e}(e_{1}),s^{\prime}),st(g_{e}(e_{2}),s^{\prime}),..,st(g_{e}(e_{n}),s^{\prime})\}$ . In particular, there exist exactly one element in $EstatusSet(s^{\prime})\in EstatusSet(S^{\prime})$ with a given edges status $ste(\hat{E},s^{\prime})$ , since each variable associated with edges in $E^{\prime}_{ij2}$ depends completely on variables associated with edges in $\hat{E}$ ( $\forall{i,j}\;x^{\prime}_{e^{\prime}_{ij2}}=0\Leftrightarrow x^{\prime}_{g_{e}{e_{j}}}=0$ ) and the status of all edges in $E^{\prime}_{ij1}$ are predetermined to be open ( $\forall{i,j}\;P(x^{\prime}_{e^{\prime}_{ij1}}=0)=1$ ).

We are left to show that there exist a one to one correspondence between $EstatusSet(S^{\prime})$ and $EstatusSet(S_{2})$ . Since the location of the agent is independent to the edges status, we can represent $S_{2}$ as a cartezian product $S_{2}=\hat{V}\times EstatusSet(S_{2})$ (According to corollary 3.1.7 $\hat{V}$ is the location space of $S_{2}$ ) but according to definition 3.1.5 we can represent $S_{2}$ as $S_{2}=\hat{V}\times EstatusSet(S^{\prime})$ , hence $EstatusSet(S^{\prime})=EstatusSet(S_{2})$ . Therefore, there exists a one-to-one correspondence between $EstatusSet(S_{1})$ and $EstatusSet(S_{2})$ . ∎

Lemma 3.1.11.

There exists one-to-one correspondence between $S_{1}$ and $S_{2}$ .

Proof.

According to lemma 3.1.9, there exist a one-to-one correspondence between V and $\hat{V}$ . According to lemma 3.1.10, there exists a one-to-one correspondence between $EstatusSet(S_{1})$ and $EstatusSet(S_{2})$ . Since $S=V\times EstatusSet(S_{1})$ and $S_{2}=\tilde{V}\times EstatusSet(S_{2})$ , there exists one-to-one correspondence between $S_{1}$ and $S_{2}$ . ∎

Definition 3.1.12.

*Let $v_{i},v_{j}\in V$ , $v^{\prime}_{ij1}\in V_{ij1}$ , $\hat{v}_{i},\hat{v}_{j}\in\hat{V}$ and $e_{j}\in E$ . We define $a_{2}$ in I’ to be the equivalent meta-action to action $a_{1}\in A_{1}$ ( $a_{1}\sim a_{2}$ ) if and only if:

$a_{2}=\left\{\begin{array}[]{ll}move(\hat{v}_{i},\hat{v}_{j})&\quad if\;a\in move(v_{i},v_{j})\\ move(\hat{v}_{i},v^{\prime}_{ij1}),move(v^{\prime}_{ij1},\hat{v}_{i})&\quad if\;a\in sense(v_{i},e_{j})\end{array}\right.$ *

Definition 3.1.13.

We define $A_{2}$ to be the set of all equivalent actions of actions in $A_{1}$ . Meaning $A_{2}=\{a2|a2\sim a1,a1\in A1\}$

Definition 3.1.14.

*We define the set $\tilde{st}_{e_{i}}$ to be the following:

$\tilde{st}_{e_{i}}=\left\{\begin{array}[]{ll}\{O_{g_{e}({e}_{i})},O_{e_{1i2}},O_{e_{2i2}},...,O_{e_{ni2}}\}&\quad if\;st_{e_{i}}=O_{e_{i}}\\ \{B_{g_{e}({e}_{i})},B_{e_{1i2}},B_{e_{2i2}},...,B_{e_{ni2}}\}&\quad if\;st_{e_{i}}=B_{e_{i}}\\ \end{array}\right.$ *

Definition 3.1.15.

*Let $Z_{2}$ be a set of observations in I’ and $Z_{1}$ be a set of observations in I. We define $z_{2}\in Z_{2}$ is the equivalent observation of $z_{1}\in Z_{1}$ (denoted by $z_{1}\sim z_{2}$ ) if and only if:

[TABLE]

Lemma 3.1.16.

The cost of action in $A_{1}$ is equal to the cost of the equivalent meta-action in $A_{2}$ .

Proof.

1

$C(move(v_{i},v_{j}))=C(move(\hat{v}_{i},\hat{v}_{j}))$ (by definition of the weight function). 2. 2

$C(sense(v_{i},e_{j}))=SC(e_{j})$ , and $C(move(\hat{v}_{i},v^{\prime}_{ij1}))=C(move(v^{\prime}_{ij1},\hat{v}_{i}))=\frac{SC(e^{\prime}_{j})}{2}$ . Therefore, $C(sense(v_{i},e_{j}))=C(move(\hat{v}_{i},v^{\prime}_{ij1}))+C(move(_{ij1},\hat{v}_{i})))$ .

Lemma 3.1.17.

Given $s_{1}\in S_{1},a_{1}\in A_{1},z_{1}\in Z_{1}$ , and $s_{2}\in S_{2},a_{2}\in A_{2},z_{2}\in Z_{2}$ such that $s_{1}\sim s_{2},a_{1}\sim a_{2},z_{1}\sim z_{2}$ , then $O_{1}(s_{1},a_{1},z_{1})=O_{2}(s_{2},a_{2},z_{2})$ .

Proof.

1

In case that $a_{1}=move(v_{j},v_{i})$ . Let $e_{i_{1}},e_{i_{2}},...,e_{i_{n}}\in E_{v_{i}}$ (the edges incident to $v_{i}$ ) and let $\tilde{E}_{v_{i}}=\{\tilde{st}_{e_{i_{1}}}\cap\tilde{st}_{e_{i_{2}}}\cap...\cap\tilde{st}_{e_{i_{m}}}\}$ . By definition of CTP, if $a_{1}=move(v_{j},v_{i})$ then the agent observes $z_{1}=st_{e_{i_{1}}},st_{e_{i_{2}}},...,st_{e_{i_{n}}}$ (the pre-known edges incident to $v_{i}$ which are revealed by the action). Therefore $O_{1}(s_{1},a_{1},z_{1})=1$ if $z_{1}=ste(E_{v_{i}},s_{1})$ and $O_{1}(s_{1},a_{1},z_{1})=0$ otherwise. The equivalent action of $a_{1}$ is $a_{2}=move(\hat{v}_{i},\hat{v}_{j})$ , hence, taking $a_{2}$ , the agent directly observes $st_{\hat{e}_{i_{1}}},st_{\hat{e}_{i_{2}}},...,st_{\hat{e}_{i_{n}}}$ , but in addition, according to definition 3.1 $x^{\prime}_{e^{\prime}_{1i2}}=0,x^{\prime}_{e^{\prime}_{2i2}}=0,...,x^{\prime}_{e^{\prime}_{ni2}}=0\Leftrightarrow x^{\prime}_{\hat{e_{i}}}=0$ , hence the agent also indirectly observes edges $e_{1i2},e_{2i2},...,e_{ni2}$ . Thus, the agent’s overall observation is $z_{2}=\tilde{E}_{v_{i}}$ . Since $z_{1}\sim z_{2}$ , by definition 3.1.15 if $z_{1}=ste(E_{v_{i}},s_{1})$ then $z_{2}=\tilde{E}_{v_{i}}$ . Hence $O_{2}(s_{2},a_{2},z_{2})=1$ if $z_{1}=ste(E_{v_{i}},s_{1})$ and $O_{2}(s_{2},a_{2},z_{2})=0$ otherwise. Thus, in this case $O_{1}(s_{1},a_{1},z_{1})=O_{2}(s_{2},a_{2},z_{2})$ 2. 2

In case that $a_{1}=sense(e_{j})$ the agent observes $z_{1}=st_{e_{j}}$ therefore $O_{1}(s_{1},a_{1},z_{1})=1$ if $z_{1}=st_{e_{j}}$ and $O_{1}(s_{1},a_{1},z_{1})=0$ otherwise. Since $a1\sim a_{2}$ , $a_{2}=(move(\hat{v}_{i},v^{\prime}_{ij1}),move(v^{\prime}_{ij1},\hat{v}_{i}))$ where agent observes $st_{\hat{e}_{j}}$ directly and observes $st_{e_{1j2}},st_{e_{2j2}},...,st_{e_{nj2}}$ indirectly(the same cause as in case 1). Thus, the agent’s overall observation is $z_{2}=\tilde{st}_{e_{j}}$ . Similarly to case 1, since $z_{1}\sim z_{2}$ , by definition 3.1.15 if $z_{1}=st_{e_{j}}$ then $z_{2}=tilde{E}_{v_{i}}$ . Hence $O_{2}(s_{2},a_{2},z_{2})=1$ if $z_{1}=st_{e_{j}}$ and $O_{2}(s_{2},a_{2},z_{2})=0$ otherwise. Thus, in this case $O_{1}(s_{1},a_{1},z_{1})=O_{2}(s_{2},a_{2},z_{2})$ as well.

Lemma 3.1.18.

Given $s_{1a}\in S_{1},s_{1b}\in S_{1},a_{1}\in A_{1},z_{1}\in Z_{1}$ , and $s_{2a}\in S_{2},s_{2b}\in S_{2},a_{2}\in A_{2},z_{2}\in Z_{2}$ such that action $a_{1}$ is taken in $s_{1a}$ and meta-action $a_{2}$ is taken in $s_{2a}$ , then if $s_{1a}\sim s_{2a},s_{1b}\sim s_{2b},a_{1}\sim a_{2}$ then $Tr_{1}(s_{1a},a_{1},s_{1b})=Tr_{2}(s_{2a},a_{2},s_{2b})$ .

Proof.

WLOG, let $Loc(s_{1a})=v_{i}$ , and $Loc(s_{1b})=v_{j}$ . Since $s_{1a}\sim s_{2a},s_{1b}\sim s_{2b}$ we get $Loc(s_{1a})=\hat{v}_{i}$ and $Loc(s_{1b})=\hat{v}_{j}$ .

•

WLOG, in case that $a_{1}=move(v_{i},v_{j})$ . If $(v_{1},v_{2})\in E$ we get $Tr_{1}(s_{1b},a,s_{1a})=1$ otherwise $Tr_{1}(s_{1b},a,s_{1a})=0$ . Furthermore, if $(v_{i},v_{j})\in E$ then $(g_{v}(v_{i}),g_{v}(v_{j}))\in\hat{E}$ which incurs $Tr_{2}(s_{2b},a,s_{2a})=1$ , otherwise $Tr_{2}(s_{2b},a,s_{2a})=0$ . Thus, in case that $a_{1}=move(v_{i},v_{j})$ we get $Tr_{1}(s_{1b},a,s_{1a})=Tr_{2}(s_{2b},a,s_{1a})$ .

•

WLOG, in case that $a_{1}=sense(e_{j},v_{i})$ . Since the sense action does not change the location of the agent we get $s_{1a}=s_{1b}$ . Since $a_{1}\sim a_{2}$ $a_{2}=move(\hat{v}_{i},v_{ij1}),move(v_{ij1},\hat{v}_{i})$ . In this case $Loc(s_{2a})=Loc(s_{2b})$ since the agent return to it original location $\hat{v}_{i}$ . This incurs $s_{2a}=s_{2b}$ and thus $Tr_{1}(s_{1b},a,s_{1a})=Tr_{2}(s_{2b},a,s_{2a})=1$ .

∎

Lemma 3.1.19.

Given states $s_{1}\in S_{1},s_{2}\in S_{2},a_{1}\in A_{1},a_{2}\in A_{2}$ then if $s_{1}\sim s_{2}$ and $a_{1}\sim a_{2}$ then $R_{1}(s_{1},a_{1})=R_{2}(s_{2},a_{2})$ .

Proof.

WLOG, let $Loc(s_{1})=v_{1}$ . Given that $s_{1}\sim s_{2}$ then $Loc(s_{2})=\hat{v_{1}}$ .

•

In case that $a_{1}=move(v_{i},v_{j})$ , $R_{1}(s_{1},a_{1})=C(move(v_{i},v_{j}))$ . If $a_{1}\sim a_{2}$ then $a_{2}=move(\hat{v}_{i},\hat{v}_{j})$ . Since $C(move(v_{i},v_{j}))=C(move(\hat{v}_{i},\hat{v}_{j}))$ we get $R_{1}(s_{1},a_{1})=R_{2}(s_{2},a_{2})$ .

•

In case that $a_{1}=sense(v_{i},e_{j})$ , $R_{1}(s_{1},a_{1})=C(sense(v_{i},e_{j}))$ . If $a_{1}\sim a_{2}$ then $a_{2}=move(v_{i},v_{ij1}),move(v_{ij1},v_{i})$ . Since $C(sense(v_{i},e_{j}))=C(move(\hat{v}_{i},v^{\prime}_{ij1}))+C(move(v^{\prime}_{ij1},\hat{v}_{i})))$ we get $R_{1}(s_{1},a_{1})=R_{2}(s_{2},a_{2})$ .

∎

Lemma 3.1.20.

$M_{1}$ * is equivalent to $M_{2}$ .*

Proof.

We have shown that there exist a one to one correspondence between $S_{1}$ and $S_{2}$ . By defining the set $A_{2}$ which consist of equivalent action in $A_{1}$ , and by defining the set $Z_{2}$ which consist of equivalent observations in $Z_{1}$ , we have shown that functions $Tr_{1}=Tr_{2}$ , $O_{1}=O_{2}$ , and $R_{1}=R_{2}$ when generated on equivalent set of states, observation and actions. ∎

Lemma 3.1.21.

$M_{2}$ * models the problem of I’.*

Proof.

Here we show that although $M_{2}$ models a subproblem of I’ ( $M_{2}$ is defined on subsets of states, actions of I’), it actually models the exact problem of I’. For every state $s\in S^{\prime}-S_{2}$ , $Loc(s)=v_{ij1}$ where $v_{ij1}\in V_{ij1}$ . An agent located in $v_{ij1}$ can only move to $v_{i}$ . In addition, in order that agent would be located in $v_{ij1}$ it has to move from $v_{i}$ . Thus $A_{2}$ replace the two move actions in to one meta-action and thus we can reduce the state set of S’ into the subset $S_{2}$ . Therefore, $M_{2}$ models I’. ∎

3.2 CTP-Forward-Arcs

Definition 3.2.1.

Let $G=(V,E)$ be a disjoint paths graph of CTP-PATH-DEP and $W=(X,Y)$ be its associated Bayesian network. Let $x_{e_{ij}},x_{e_{ik}}\in X$ be the associated node of edges $e_{ij}$ and $e_{ik}$ (note that the edges are in the same path $i$ in G). Then the arc $\left\langle x_{e_{i,j}},x_{e_{i,k}}\right\rangle\in Y$ is Forward-Arc if $j<k$ , i.e. if $e_{ij}$ is closer to s than $e_{ik}$ .

Definition 3.2.2.

CTP-Forward-Dependency(CTP-FOR-DEP) is a special case of CTP-PATH-DEP such that all the arcs in W are Forward-Arcs.

Theorem 3.2.3.

CTP-FOR-DEP is solvable in polynomial time.

Proof outline. CTP on disjoint paths graph with independent distribution over the edges(CTP-PATH-IND) is shown to be solvable in polynomial time [Bnaya, Felner and Shimony, 2009]. We show that we can transform CTP-FOR-DEP into an instance of CTP-PATH-IND with new distribution over the edges such that the optimal policy of the new CTP-PATH-IND can be applied to CTP-FOR-DEP.

Proof.

Let $I=(G,W,w,s,t)$ be an instance of CTP-FOR-DEP. We construct a new instance $I^{\prime}=(G,W^{\prime},w,s,t)$ of CTP-PATH-IND by constructing a new Bayesian network W’(X’,Y’) of I’ such that

•

$Y^{\prime}=\left\{\right\}$ . In other words W’ is “arc free” where each node is an independent component in the BN.

•

$\forall x^{\prime}_{e_{ik}}\in X^{\prime}$ $P(x^{\prime}_{e_{ik}}=1)=P(x_{e_{ik}}=1|x_{e_{i1}}=0,x_{e_{i2}}=0,...,x_{e_{i(k-1)}}=0)$ .

Let $M=(B,A,Tr,R)$ be a belief state MDP of I, where B is the set of belief states , A is the set of actions, Tr is a set of transition probabilities, R is the reward function. We construct a new belief state MDP $M^{\prime}=(B^{\prime},A,Tr^{\prime},R^{\prime})$ of I’ where B’ is the set of belief states, A is a set of actions which is common to the set of action in I (since it refers to the same graph G), Tr’ is a set of transition probabilities, and R’ is the reward function .

Definition 3.2.4.

Let $f:B\rightarrow B^{\prime}$ be a function defined as follows: Let $b\in B$ and $b^{\prime}\in B^{\prime}$ such that $f(b)=b^{\prime}$ then $\left\langle b\right\rangle=\left\langle b^{\prime}\right\rangle$ .

Notice that $f(b)$ is well defined since there is a one to one mapping from $\left\langle b\right\rangle$ to $b$ and from $\left\langle b^{\prime}\right\rangle$ to $b^{\prime}$ .

Lemma 3.2.5.

Let $b,\hat{b}$ be reachable belief states in B and let $a\in A$ be an action. Then $Tr(\hat{b}|a,b)=Tr(f(\hat{b})|a,f(b))$

Proof.

Let $a=Move(e)$ where $e=\left\langle v_{i(k-1)},v_{i(k)}\right\rangle$ and let $e_{f}=for(e)$

1

In case that $Tr(\hat{b}|a,b)=0$ (i.e. action $a$ not performable in $b$ ), then $Tr^{\prime}(f(\hat{b})|a,f(b))=0$ . If $Tr(\hat{b}|a,b)=0$ then one of the following cases must satisfied:

•

Edge $e$ is not adjacent to the location of the agent in $b$ , i.e. $e\notin Inc(LocB(b))$ . If $e\notin Inc(LocB(b))$ then $e\notin Inc(LocB(f(b))$ since $LocB(b)=LocB(f(b))$ . Thus $Tr^{\prime}(f(\hat{b^{\prime}})|a,f(b))=0$ .

•

Edge $e$ is not adjacent to location of the agent in $\hat{b}$ , i.e. $e\notin Inc(LocB(\hat{b}))$ . If $e\notin Inc(LocB(\hat{b}))$ then $e\notin Inc(LocB(f(\hat{b}))$ , since $LocB(\hat{b})=LocB(f(\hat{b}))$ . Thus $Tr^{\prime}(f(\hat{b})|a,f(b))=0$ .

•

Edge $e$ is blocked in belief state $b$ , i.e. $stb(b,e)=B_{e}$ . If $stb(b,e)=B_{e}$ then $stb(f(\hat{b}),e)=B_{e}$ since $stb(b,e)=stb(f(\hat{b}),e)$ . Thus $Tr^{\prime}(f(\hat{b})|a,f(b))=0$ .

•

There exist an edge $e^{\prime}\neq For(e)$ such that $stb(b,e^{\prime})\neq stb(\hat{b},e^{\prime})$ . Since $stb(b,e)=stb(f(b),e)$ and $stb(f(\hat{b}),e)=stb(\hat{b},e)$ , there exist an edge $e^{\prime}\neq For(e)$ Since is blocked in belief state $b$ , i.e. $stb(b,e)=B_{e}$ . If $stb(b,e)=B_{e}$ then $stb(\hat{b},e)=B_{e}$ , since $stb(b,e)=stb(f(\hat{b}),e)$ . Thus $Tr^{\prime}(f(\hat{b})|a,f(b))=0$ . 2. 2

In case that $Tr(\hat{b}|a,b)>0$ (i.e action $a$ is performable in $b$ ) then edge $e$ has to be open and one of the following cases must satisfied:

•

Edge $e_{f}$ is Open in b (i.e $stb(b,e_{f})=O_{e_{f}}$ ). If $stb(b,e_{f})=O_{e_{f}}$ then the status of all edges in $b$ must be the same as in $\hat{b}$ , i.e. ( $\forall e\in Estb(e,b)=stb(e,\hat{b})$ ) since the agent does not sense any unknown edge when performing $a$ and hence $Tr(\hat{b}|a,b)=1$ . If $stb(b,e_{f})=O_{e_{f}}$ then $stb(f(b),e_{f})=O_{e_{f}}$ and the status of all edges in $b$ must be the same as in $\hat{b}$ from the same reasons as before. Hence $Tr^{\prime}(f(\hat{b})|a,f(b))=1$ , and we have $Tr(\hat{b}|a,b)=Tr^{\prime}(f(\hat{b})|a,f(b))$ .

•

Edge $e_{f}$ is Unknown in b (i.e $stb(b,e_{f})=U_{e_{f}}$ ). Since W is the belief network of CTP-FOR-DEP and b is reachable from $b_{0}$ , the status of all edge $e_{i1},e_{i2},...,e_{ik}$ have to be Open (In order to reach $v_{ik}$ all edges in path i from s to $v_{ik}$ must be traversable). Thus,

[TABLE]

In addition $stb(f(b),e_{f})=U_{e_{f}}$ (since $stb(f(b),e_{f})=stb(b,e_{f})$ ). There is no dependencies in W’ (i.e. $Y^{\prime}=\{\}$ ). Therefore,

[TABLE]

However, by definition of X’ we have,

[TABLE]

Hence,

[TABLE]

∎

Lemma 3.2.6.

Let $b,\hat{b}$ be reachable belief states in B and let $a\in A$ be an action. Then $R(b,a,\hat{b})=R^{\prime}(f(b),a,f(\hat{b}))$ .

Proof.

From definition 2.5 it follows that:

$R(b,a,\hat{b})=R^{\prime}(f(b),a,f(\hat{b}))$ if and only if $Tr(\hat{b}|a,b)=Tr^{\prime}(f(\hat{b})|a,f(b))$ .

But we proved in lemma 3.2.5 that $Tr(\hat{b}|a,b)=Tr^{\prime}(f(\hat{b})|a,f(b))$

Thus, $R(b,a,\hat{b})=R^{\prime}(f(b),a,f(\hat{b}))$ . ∎

Definition 3.2.7.

We define the predicate $REACHABLE(b_{n},b_{0})$ to be true if and only if $b_{n}$ is reachable from $b_{0}$ in belief-MDP M. i.e. there exist $b_{1},...,b_{n-1}$ such that $\prod\limits_{0\leq i\leq n-1,a\in A}Tr(b_{i},a,b_{i+1})>0$

Lemma 3.2.8.

Let $b\in B$ . Then $REACHABLE(b,b_{0})$ is true if and only if $REACHABLE(f(b),f(b_{0}))$ is true.

Proof.

Follows from definition 3.2.7 and lemma 3.2.5. ∎

Definition 3.2.9.

Define the set $B_{reach}\subset B$ to be the set of all belief states $b\in B$ that satisfy $REACHABLE(b,b_{0})$ . Namely,

[TABLE]

Next, we define an analogue set for B’,

Definition 3.2.10.

Define the set $B^{\prime}_{reach}\subset B$ to be the set of all belief states $f(b)\in B^{\prime}$ , for $b\in B$ , that satisfy $REACHABLE(f(b),f(b_{0}))$ . Namely,

[TABLE]

Let $M_{r}=(B_{reach},A,Tr,R)$ be a belief MDP over belief state $B_{reach}$ and let $M^{\prime}_{r}=(B^{\prime}_{reach},A,Tr^{\prime},R^{\prime})$ be a belief MDP over belief state $B^{\prime}_{reach}$

Lemma 3.2.11.

$M_{r}$ * and $M^{\prime}_{r}$ are isomorphism.*

Proof.

Follows from definition 3.2.7 that $F$ is a bijection over $B_{reach}$ and $B^{\prime}_{reach}$ . In addition, F preserves the function Tr , Tr’ lemma 3.2.5 as well as R,R’ lemma 3.2.6 . ∎

Corollary 3.2.12.

Let $\pi^{*}$ be the optimal policy of I, and $\pi^{\prime*}$ be the optimal policy of I’. Then for every reachable belief state $b\in B_{reach}$ we have $\pi^{*}(b)=\pi^{\prime*}(f(b))$ .

Proof.

Since $M_{r}$ and $M^{\prime}_{r}$ are isomorphism, the problems are equivalent and their optimal solutions are equivalent. ∎

Therefore, we can transform any instance of CTP-FOR-DEP into an instance of CTP-PATH-IND, apply the algorithm which solves CTP-PATH-IND in polynomial time, and equivalent optimal solution is guaranteed (corollary 3.2.12).

Now, we show that determining the probability $P(x_{e_{i,k}}=1|x_{e_{i,1}}=0,x_{e_{i,2}}=0,...,x_{e_{i,k-1}}=0)$ for all nodes can be computed in polynomial time. We use the Bayesian theorem to get:

[TABLE]

We use the chain rule to get:

[TABLE]

The variables in the Bayesian network are topologically ordered by their order in the path and hence each probability $P(x_{e_{ik}})$ can be iteratively computed given that its ancestors values have already been determined(using equations 3.2,3.2). Therefore, inferring the probability of each edge takes linear time and inferring the probability of all edges takes $O(|E|^{2})$ . Thus computing the optimal policy takes polynomial time. ∎

3.3 CTP-PATH-DEP

Definition 3.3.1.

CTP-PATH-DEP is a special case of CTP-DEP where the associated Bayesian network has dependencies only between edges on the same path.

Theorem 3.3.2.

CTP-PATH-DEP is NP-hard.

Proof outline By reduction from 3-SAT to CTP-PATH-DEP.

Proof.

Let $L$ be a set boolean variables $l_{1},...,l_{n}$ . Let the 3CNF formula $F$ be a conjunction of the clauses $C_{1},C_{2},...,C_{k}$ where each clause $C_{i}$ is a disjunction of three literals ${l^{\prime}_{i}}^{1},{l^{\prime}_{i}}^{2},{l^{\prime}_{i}}^{3}$ and for each literal ${l^{\prime}_{i}}^{j}$ it holds that ${l^{\prime}_{i}}^{j}\in L$ or $\neg{{l^{\prime}_{i}}^{j}}\in L$ . We construct the instance $I=(G,W,w,s,t)$ of CTP-PATH-DEP from F, such that F is satisfiable if and only if the expected cost of the optimal policy is greater than some given constant. I is defined as follows: $G=(V,E)$ is a graph consisting two disjoint paths $p_{1},p_{2}$ , where

$p_{1}=\left\langle e_{Y},e_{d1},...,e_{d(k-1)},e_{c_{1}},...,e_{c_{k}},e_{l_{1}},...,e_{l_{n}},e_{R}\right\rangle$ (The edges are ordered from the edge incident to s to the edge incident to t). Edges $e_{c_{1}},...,e_{c_{k}}$ correspond to clauses $C_{1},...,C_{k}$ respectively, and edges $e_{l_{1}},...,e_{l_{n}}$ correspond to variable $l_{1},...,l_{n}$ respectively. The correspondence will be define later in this proof. 2. 2.

$p_{2}$ consist of a single edge $e_{L}$ .

w is the weight function over the edges, is defined by:

•

$w(e_{Y})=1$

•

$w(e_{L})=(1+\frac{k}{2^{n+1}})$

•

$w(e)=0$ for all other edges.

$W=(X,Y)$ is a Bayesian network.

Definition 3.3.3.

For every edge $e\in E$ in path $p_{1}$ , we define the variable $x_{e}\in X$ to be the variable corresponded to edge $e$ such that $x_{e}=0$ if and only if $e$ is Open.

The set of node $X$ of $W$ is a union of the following sets:

•

$X_{Y}=x_{Y}$ . $X_{Y}$ is a set that contains the single variable $x_{Y}$ .

•

$X_{R}=x_{R}$ . $X_{R}$ is a set that contains the single variable $x_{R}$ .

•

$X_{l}=\{x_{l_{1}},x_{l_{2}},...,x_{l_{n}}\}$ . $X_{l}$ is a set that contains all nodes that correspond to variables $l_{1},...,l_{n}$ .

•

$X_{c}=\{x_{c_{1}},x_{c_{2}},...,x_{c_{k}}\}$ . $X_{c}$ is a set that contains all nodes that correspond to variables $c_{1},...,c_{k}$ ..

•

$X_{d}=\{x_{d_{1}},x_{d_{2}},...,x_{d_{k-1}}\}$ . $X_{c}$ is a set that contains all nodes that correspond to variables $d_{1},...,d_{k-1}$ ..

Namely, $X=X_{Y}\cup X_{R}\cup X_{l}\cup X_{c}\cup X_{d}$ .

The arcs in $Y$ are defined by the followng sets:

•

$Y_{Rl_{i}}=\{\left\langle x_{R},x_{l_{i}}\right\rangle\}$ . An arc from node $x_{R}$ to node $x_{l_{i}}$ .

•

$Y_{Rc_{i}}=\{\left\langle x_{R},x_{c_{i}}\right\rangle\}$ . An arc from node $x_{R}$ to node $x_{c_{i}}$ .

•

$Y_{Rd_{i}}=\{\left\langle x_{R},x_{d_{i}}\right\rangle\}$ . An arc from node $x_{R}$ to node $x_{d_{i}}$ .

•

$Y_{li}=\{\left\langle x_{l_{i}^{1}},x_{c_{i}}\right\rangle,\left\langle x_{l_{i}^{2}},x_{c_{i}}\right\rangle,\left\langle x_{l_{i}^{3}},x_{c_{i}}\right\rangle\}$ . A set of three arcs from each variable node $x_{l_{i}^{j}}$ (for $1\leq j\leq 3$ ) to clause node $x_{c_{i}}$ such that $l_{i}^{j}$ is the variable corresponding to literal ${l^{\prime}_{i}}^{j}$ . For instance $C_{5}=\{{l^{\prime}_{5}}^{1}\vee\neg{l^{\prime}_{5}}^{2}\vee\neg{l^{\prime}_{5}}^{3}\}$ then $l_{5}^{1}={l^{\prime}_{5}}^{1},{l^{\prime}_{5}}^{2}=\neg{l^{\prime}_{5}}^{2}$ , and $l_{5}^{3}=\neg{l^{\prime}_{5}}^{3}$ .

•

$\forall_{1\leq i\leq k}$ $(x_{ci},x_{di})\in Y$ - Arc from each node $x_{ci}$ to a corresponding node $x_{di}$

•

$\forall_{1\leq i\leq k}$ $(x_{di},x_{Y})\in Y$ - Arc from each node $x_{di}$ to node $x_{Y}$

The condition probabilities of W are as follows:

1

$P(x_{e_{R}=0})=0.5$ ( $x_{R}$ is an independent variable). 2. 2

For every variable node $x_{i}$ it holds that $P(x_{i}=0|x_{R}=0)=1$ , i.e. if $x_{R}=0$ then path $p_{1}$ is always open with probability 1. 3. 3

Given $x_{R}=1$ , W is specified as follows:

(a)

$x_{ci}=0)\Leftrightarrow\bigwedge_{j=1}^{3}{x_{l_{ij}}=0}$ 2. (b)

$x_{d1}=0\Leftrightarrow x_{c_{1}}=0$ 3. (c)

$\forall i>1\;x_{d(i+1)}=0\Leftrightarrow x_{c_{i}}=0,x_{di}=0$ 4. (d)

$x_{Y}=1\Leftrightarrow\bigwedge_{i}{x_{di}=0}$

The reduction maps each variable of F to a variable of W such that,

•

Each boolean SAT variable $l_{i}$ is mapped to a binary variable in the Bayes network $x_{li}$ , such that $x_{li}=0$ if and only if $l_{i}=T$

•

Each clause $C_{i}$ is mapped to binary variable $x_{ci}$ such that $x_{ci}=0$ if and only if $C_{i}=T$ .

Lemma 3.3.4.

Given $x_{R}=1$ , then F is satisfiable $\Leftrightarrow x_{Y}=0$ .

Proof.

If $x_{R}=1$ then F is satisfiable $\Leftrightarrow C_{1}=T,C_{2}=T,...,C_{n}=T\Leftrightarrow x_{c_{1}}=0,x_{c_{2}}=0,...,x_{c_{n}}=0$ in addition, $x_{c_{1}}=0\Leftrightarrow x_{d1}=0$ and $\forall i>1\;x_{d(i+1)}=0\Leftrightarrow x_{c_{i}}=0,x_{di}=0$ . Thus, $x_{c_{1}}=0,x_{c_{2}}=0,...,x_{c_{n}}=0\Leftrightarrow x_{d_{1}}=0,x_{d_{2}}=0,...,x_{d_{n}}=0\Leftrightarrow x_{Y}=0$ ∎

For simplicity $w(e_{Y})$ is denoted by CY and $w(e_{L})$ is denoted by CL. Note that $CL>CY$ .

The construction of the reduction is computable in polynomial time since the graph G contains $O(n)$ vertices, $O(n)$ edges and the Bayes network W contains $O(n)$ nodes and $O(n)$ arcs. In addition, function $g$ , which maps each variable in $F$ to variable in $W$ , is computable in polynomial time as well.

The optimal policy is committing in a sense that after the agent chooses a path, it keeps following this path until reaching t, unless agent hits a blocked edge. This is caused due to the fact that if agent chooses to traverse $p_{1}$ first, after traversing the first edge $e_{Y}$ , it is optimal to keep following $p_{1}$ toward $t$ since the rest of the edges in $p_{1}$ are 0, and thus if $p_{1}$ is traversable, no extra travel cost is paid. On the other hand, if $p_{1}$ is not traversable then the agent pays extra $CY$ regardless of how many edges did it traversed in $p_{1}$ . Therefore the decision problem of the optimal policy here is simply whether to choose $p_{1}$ as a first path to try or $p_{2}$ .

Notation 3.3.5.

Let $\pi_{12}$ denote a committing policy that chooses $p_{1}$ as a first path to try, and $\pi_{21}$ denote a committing policy that chooses $p_{2}$ as a first path to try.

Lemma 3.3.6.

Let C be a constant, such that $1+\frac{k}{2^{n+2}}<C<1+\frac{k}{2^{n+1}}$ , where k is the number of models in F and n is the number of boolean-SAT-variables in F. Let $\pi^{*}$ be the optimal policy of I. F is satisfiable if and only if $Exp(\pi^{*})>C$

Proof.

$=>$ Suppose that F is satisfiable. The probability that $e_{Y}$ is open is

[TABLE]

by construction of W:

[TABLE]

The probability $P(F=True)=\frac{k}{2^{n}}$ since there are k sets of literals of F such that its instantiation gives F=true, and the domain size is the number of all possible instantiations to $l_{1},...,l_{n}$ , which equals $2^{n}$ . Thus,

[TABLE]

Let PY denote the probability $P(x_{Y}=0)$ .

Now, we want to calculate the probability that path $p_{1}$ is open given that $e_{y}$ is open.

[TABLE]

According to W:

[TABLE]

if $x_{R}=1$ then $p_{1}$ is blocked

[TABLE]

Setting equations 3.8,3.10,3.11 in equation 3.3 gives:

[TABLE]

Denote $w(p_{1})$ to be the sum cost of all edge in $p_{1}$ and $w(p_{2})$ to be the sum cost of all edge in $p_{2}$ . The expected cost of the policy when choosing first path $p_{1}$ is

[TABLE]

Note that in case that the agent traverses $e_{Y}$ and $p_{1}$ is blocked, the agent hits a blocked edge and is forced to pay another CY extra, when the agent moves backward to s. The expected cost of $\pi_{21}$ is simply CL. Since $CL<2PYCY+0.5(CL-CY)$ , the optimal policy is $\pi_{21}$ and $Exp(\pi^{*})=CL$ . It is given that $CL=1+\frac{k}{2^{n+1}}>C$ , therefore if F is satisfiable then $Exp(\pi^{*})>C$ .

$<=$ Suppose that F is not satisfiable. Now, the calculation of the probability is easier because we know that the only case where $e_{Y}$ is open is when $e_{R}=0$ .

Therefore,

[TABLE]

According to equations 3.11,3.14,3.15,3.16

[TABLE]

Thus if $e_{Y}$ is open then $p_{1}$ is open. The expected cost of $\pi_{12}$ is

[TABLE]

Again the expected cost of $\pi_{21}$ is CL. Since $CL>CY\Rightarrow CL>0.5\cdot(CY+CL)$ the optimal policy is $\pi_{12}$ and therefore $Exp(\pi^{*})=0.5\cdot(CY+CL)=0.5(1+1+\frac{k}{2^{n+1}})=1+\frac{k}{2^{n+1}}$ . It is given that $1+\frac{k}{2^{n+1}}<C$ . and thus if F is not satisfiable then $Exp(\pi^{*})<C$ ∎

∎

3.4 Theoretical Properties of Belief-MDP for CTP

In the following section, we are given an instance $I=(G,P,w,s,t)$ of CTP, where $G=(V,E)$ . We construct a belief state MDP $M_{S}=(B,A,Tr,R,b_{0})$ of I, where S is the state set of I.

Definition 3.4.1.

Policy $\pi$ is called finite if the AO-graph for $\pi$ is acyclic(DAG).

Notation 3.4.2.

Denote the expected cost of the optimal policy of $M_{S}$ in belief state $b$ as $C^{*}(b)$ ; namely, $C^{*}(b)\equiv C^{\pi^{*}}(b)$ .

If the AO-graph for policy $\pi$ is acyclic then $C^{\pi}$ is finite [Bonet, 2010]. By definition, there is a traversable edge $\left\langle s,t\right\rangle$ in G. Therefore, there is a policy $\pi$ with finite cost and hence $C^{\pi^{*}}$ is finite [Bonet, 2010]. It should be noted that all policies referred to this section are finite.

Definition 3.4.3.

The predicate $MoreBlocked(b_{1},b_{2})$ , defined over $b_{1},b_{2}$ , is true if and only if the following properties are satisfied:

$Loc(b_{1})=Loc(b_{2})$ ** 2. 2.

For all $e\in E$ ,

•

$stb(e,b_{1})=Open$ * if and only if $stb(e,b_{2})=Open$ .*

•

$stb(e,b_{1})=Blocked$ * if $stb(e,b_{2})=Blocked$ .*

•

$stb(e,b_{1})=Unknown$ * if $stb(e,b_{2})=Unknown$ or if $stb(e,b_{2})=Blocked$ .*

The predicate $MoreBlocked(b_{1},b_{2})$ indicates that “ $b_{1}$ is at least as blocked as $b_{2}$ ”, meaning if the pair $b_{1},b_{2}$ satisfies $MoreBlocked(b_{1},b_{2})$ then $Blocked(b_{1})\subseteq Blocked(b_{2})$ .

Let $E=\{e_{1},e_{2},e_{3}\}$ . We demonstrate $MoreBlocked(b_{1},b_{2})$ by the following table:

[TABLE]

Definition 3.4.4.

Let $b_{1},b_{2}\in B$ . Define the predicate $MoreOpen(b_{1},b_{2})$ to be true if and only if the following properties are satisfied:

$Loc(b_{1})=Loc(b_{2})$ ** 2. 2.

For all $e\in E$ ,

•

$stb(e,b_{1})=Blocked$ * if and only if $stb(e,b_{2})=Blocked$ .*

•

$stb(e,b_{1})=Open$ * if $stb(e,b_{2})=Open$ .*

•

$stb(e,b_{1})=Unknown$ * if $stb(e,b_{2})=Unknown$ or if $stb(e,b_{2})=Open$ .*

Intuitively, $MoreOpen(b_{1},b_{2})$ means that “ $b_{2}$ is at least as open as $b_{1}$ ”, where the set of known open edges in $b_{1}$ is contained in the set of known open edges in $b_{2}$ .

We demonstrate $MoreOpen(b_{1},b_{2})$ by the following table:

[TABLE]

Definition 3.4.5.

We define the function $BlockEdges:\mathcal{P}(E)\times B\rightarrow B$ as follows: Let $b_{1},b_{2}\in B$ such that $b_{1}=BlockEdges(\hat{E},b_{2})$ , then $\left\langle b_{1}\right\rangle$ is defined by the following(by its elements):

$Loc(b_{1})=Loc(b_{2})$ . 2. 2.

For all $e\in\hat{E}$ $stb(e,b_{1})=Blocked$ . 3. 3.

For all $e\notin\hat{E}$ $stb(e,b_{1})=stb(e,b_{2})$

Note that by corollary 2.5.9, $b_{1}$ can be determined from $\left\langle b_{1}\right\rangle$ . The function is called $BlockEdges(\hat{E},b_{2})$ since it “blocks” all the edges in $\hat{E}$ (i.e. for every edge $e\in\hat{E}$ the function $BlockEdges(\hat{E},b_{2})$ “changes” the status of edge $e$ in $b_{2}$ to $stb(e,b_{1})=Blocked$ ) where all the other element in $\left\langle b_{2}\right\rangle$ are remained unchanged in $\left\langle b_{1}\right\rangle$ .

For example, we are given belief state $b$ such that $\left\langle b\right\rangle=\left\langle v_{1},B_{e_{1}},U_{e_{2}},O_{e_{3}},U_{e_{4}},U_{e_{5}}\right\rangle$ and $\hat{E}=\{e_{2},e_{4},e_{5}\}$ . Hence, if $b^{\prime}=BlockEdges(b,\hat{E})$ then $\left\langle b^{\prime}\right\rangle=\left\langle v_{1},B_{e_{1}},B_{e_{2}},O_{e_{3}},B_{e_{4}},B_{e_{5}}\right\rangle$

Property 3.4.6.

For every belief state $b$ and a set of edges $\hat{E}\subset E$ we have $MoreBlocked(\hat{E},BlockEdges(\hat{E},b))$ .

Proof.

Follows immediately from definition 3.4.5. ∎

Definition 3.4.7.

We define the function $OpenEdges:\mathcal{P}(E)\times B\rightarrow B$ as follows: Let $b_{1},b_{2}\in B$ such that $b_{1}=OpenEdges(\hat{E},b_{2})$ , then $\left\langle b_{1}\right\rangle$ is defined by the following(by its elements):

$Loc(b_{1})=Loc(b_{2})$ . 2. 2.

*For all $e\in\hat{E}$ $stb(e,b_{1})=Open$ . * 3. 3.

For all $e\notin\hat{E}$ $stb(e,b_{1})=stb(e,b_{2})$

The function is called $OpenEdges(\hat{E},b_{2})$ since it “open” all the edges in $\hat{E}$ (i.e. for every edge $e\in\hat{E}$ the function $OpenEdges(\hat{E},b_{2})$ “changes” the status of edge $e$ in $b_{2}$ to $stb(e,b_{1})=Open$ ) where all the other element in $\left\langle b_{2}\right\rangle$ are remained unchanged in $\left\langle b_{1}\right\rangle$ .

For example, we are given belief state $b$ such that $\left\langle b\right\rangle=\left\langle v_{1},B_{e_{1}},U_{e_{2}},O_{e_{3}},U_{e_{4}},U_{e_{5}}\right\rangle$ and $\hat{E}=\{e_{2},e_{4},e_{5}\}$ . Hence, if $b^{\prime}=OpenEdges(b,\hat{E})$ then $\left\langle b^{\prime}\right\rangle=\left\langle v_{1},B_{e_{1}},O_{e_{2}},O_{e_{3}},O_{e_{4}},O_{e_{5}}\right\rangle$

Property 3.4.8.

For every belief state $b$ and a set of edges $\hat{E}\subset E$ we have $MoreOpen(\hat{E},OpenEdges(\hat{E},b))$ .

Proof.

Follows immediately from definition 3.4.7. ∎

Definition 3.4.9.

We define the function $BlockEdges^{-1}:\mathcal{P}(E)\times B\rightarrow\mathcal{P}(B)$ as follows: $B_{2}=BlockEdges^{-1}(\hat{E},b_{1})$ is the set of all belief states $b_{2}\in B$ such that $b_{1}=BlockEdges(\hat{E},b_{2})$ . Meaning $BlockEdges^{-1}(\hat{E},b_{1})=\{b_{2}|b_{1}=BlockEdges(\hat{E},b_{2})\}$ .

Note that the function $BlockEdges^{-1}$ is somehow a generalization of an inverse function in a way that for every $b\in B$ and $\hat{E}\in\mathcal{P}(E)$ we get $b=BlockEdges(BlockEdges^{-1}(b,\hat{E}),\hat{E})$ .

For instance, let $B^{\prime}=BlockEdges^{-1}(\hat{E},b)$ where $\hat{E}=\{e_{2}\}$ and $b\in B$ such that $\left\langle b\right\rangle=\left\langle O_{e1},B_{e2}\right\rangle$ . Then $B^{\prime}=\{b_{1},b_{2}\}$ such that,

•

$\left\langle b_{1}\right\rangle=\left\langle O_{e1},B_{e2}\right\rangle$

•

$\left\langle b_{2}\right\rangle=\left\langle O_{e1},U_{e2}\right\rangle$

Definition 3.4.10.

*Let $\hat{E}\in E$ . The equivalence relation $\sim_{\hat{E}}$ is defined as follows:

The belief states $b_{1},b_{2}$ satisfy $b_{1}\sim_{\hat{E}}b_{2}$ if and only if $MoreBlocked_{\hat{E}}(b_{1})=MoreBlocked_{\hat{E}}(b_{2})$ .*

Definition 3.4.11.

Let $DiffEStatus:B\times B\rightarrow\mathcal{P}(E)$ be a function. $E^{\prime}=DiffEStatus(b,b^{\prime})$ is defined to be the set of all edges incident to $Loc(b^{\prime})$ which are unknown in $b$ and known in $b^{\prime}$ , i.e.

[TABLE]

Definition 3.4.12.

Let $\Pi$ be a set of finite policies over B. We define the function $SimBlocked:\Pi\times\mathcal{P}(E)\rightarrow\Pi$ as follows: For every pair of belief states $b,b^{\prime}\in B$ such that $b^{\prime}=BlockEdges(b,E^{\prime})$ , the policy $\pi^{\prime}=SimBlocked(\pi,\hat{E})$ satisfies $\pi^{\prime}(b)=\pi(b^{\prime})$ .

Definition 3.4.13.

We define the function $Next:B\times A\rightarrow\mathcal{P}(B)$ as follows: $B^{\prime}=Next(b,a)$ is the set of all possible belief state that can be reached from belief state $b$ immediately after taking action $a$ . Meaning

[TABLE]

Definition 3.4.14.

Let $B_{1}\subset B$ and $\hat{E}\subset E$ . Define $Set-Blocked_{\hat{E}}(B_{1})=\{b_{2}|b_{2}=BlockEdges(\hat{E},b_{1}),b_{1}\in B_{1}\}$

Lemma 3.4.15.

Let $b_{1},b_{2}\in B$ such that $MoreBlocked(b_{1},b_{2})$ . Let $b_{1}^{\prime},b_{2}^{\prime}\in B,\hat{E}\subset E$ such that $b_{2}^{\prime}=BlockEdges(\hat{E},b_{1}^{\prime})$ . Let $a\in A$ such that $Tr(b_{1},a,b_{1}^{\prime})>0$ , then $Tr(b_{2},a,b_{2}^{\prime})>0$ .

Proof.

Let $e=\left\langle v_{i},v_{j}\right\rangle$ . If $a=Sense(e)$ then $a$ can be performed in any belief state. However, if $a=Move(e)$ then, for every $b\in B$ , $a$ can be performed in $b$ if and only if $Loc(b)=v_{i}$ and $stb(e,b)=Open$ . By definition 3.4.3 we have $Loc(b_{1})=Loc(b_{2})$ . All belief state in B are consistent with a given realization(all belief states in B describes the knowledge about the same environment), and since $e$ is known in $b_{1}$ ( $e\in Inc(v_{i})$ ), we get $st(e,b_{1})=Open$ if and only if $stb(e,b_{2})=Open$ . Thus an agent in $b_{1}$ can perform $a=Move(e)$ if and only if an agent in $b_{2}$ can be perform $a=Move(e)$ . But we are given that $Tr(b_{1},a,b_{1}^{\prime})>0$ , hence $a$ can be performed in $b_{2}$ as well. Let $\hat{E_{1}}=DiffEStatus(b_{1},b_{1}^{\prime})$ and let $\hat{E_{2}}=DiffEStatus(b_{2},b_{2}^{\prime})$ . We are left to show that the status of all edges in $E\backslash\hat{E_{2}}$ is equal in $b_{2}$ and in $b_{2}^{\prime}$ i.e. for all $e\in E\backslash\hat{E_{2}}$ $stb(e,b_{2})=stb(e,b_{2}^{\prime})$ . By definition 3.4.3 we have $UnknownEdges(b_{2}^{\prime})\subseteq UnknownEdges(b_{1}^{\prime})$ , and thus $\hat{E_{2}}\subseteq\hat{E_{1}}$ . Thus, by definition of DiffEStatus, the status of all edges in $E\backslash\hat{E_{2}}$ , is equal in $b_{2}$ and in $b_{2}^{\prime}$ . This satisfies all conditions for having $Tr(b_{2},a,b_{2}^{\prime})>0$ . ∎

Lemma 3.4.16.

Let $b_{1},b_{2}\in B$ such that $MoreBlocked(b_{1},b_{2})$ . Let $B_{1}=NEXT(b_{1})$ and $B_{2}=Set-Blocked_{\hat{E}}(B_{1})$ . Let $b_{2i}\in B_{2}$ and $B_{1}^{\prime}\subset B_{1}$ such that $B_{1}^{\prime}=MoreBlocked^{-1}(\hat{E},b_{2i})$ . Then, $Tr(b_{2},a,b_{2i})=\sum_{b_{1i}\in B_{1}^{\prime}}{Tr(b_{1},a,b_{1i})}$ .

Proof.

Let $b_{1}^{\prime}\in B_{1}$ . By definition of $B_{1}$ , $Tr(b_{1},a,b_{1}^{\prime})>0$ , hence, by lemma 3.4.15, $Tr(b_{2},a,b_{2}^{\prime})>0$ . Let $E^{\prime}=DiffEStatus(b_{2},b_{2}^{\prime})$ , we define the probability $P_{2E^{\prime}}$ by,

[TABLE]

By definition of transition function $Tr(b_{2},a,b_{2}^{\prime})=P_{2E^{\prime}}$ . For every $0\leq i\leq n$ , where $n=|B_{1}^{\prime}|$ , define $E_{i}^{\prime\prime}=DiffEStatus(b_{1},b_{1i})$ . For every $i$ , $UnknownEdges(b_{2}^{\prime})\subseteq UnknownEdges(b_{1i})$ , hence $E^{\prime}\subseteq E_{i}^{\prime\prime}$ . We define the probabilities $P_{1E^{\prime}},P_{1E_{i}^{\prime\prime}}$ as follows,

[TABLE]

By definition of transition function, $Tr(b_{1},\pi(b_{1}),b_{1i})=P_{1E^{\prime}}P_{1E_{i}^{\prime\prime}}$ .

Summing up $Tr(b_{1},a,b_{1i})$ over all $b_{1i}\in B_{1}^{\prime}$ gives,

[TABLE]

$P_{1E^{\prime}}$ is equal for all $b_{1i}\in B_{1}^{\prime}$ .

** $\sum\limits_{b_{1i}\in B_{1}^{\prime}}P_{E_{i}^{\prime\prime}}=1$ (The sum of all marginal probabilities equals 1).

For all $e\in E^{\prime}$ we have $stb(e,b_{1i})=stb(e,b_{2}^{\prime})$ , hence $P_{1E^{\prime}}=P_{2E^{\prime}}$ .

Thus,

[TABLE]

∎

Theorem 3.4.17.

Let $b_{1},b_{2}\in B$ such that $MoreBlocked(b_{1},b_{2})$ . Then $C^{*}(b_{2})\geq C^{*}(b_{1})$ .

Proof outline: We prove that for every finite policy $\pi$ there is a finite policy $\pi^{\prime}$ such that $C^{\pi}(b_{2})=C^{\pi^{\prime}}(b_{1})$ .

Proof.

By induction. Let $\hat{E}$ be the subset of all edges $e\in E$ such that $e$ is unknown in $b_{1}$ and blocked in $b_{2}$ . i.e.

[TABLE]

Let $B_{1}=NEXT(b_{1})$ and $B_{2}=Set-Blocked_{\hat{E}}(B_{1})$ where $B_{2}=\{b_{21},b_{22},...,b_{2n}\}$ . Let $\pi$ be a finite policy and let $\pi^{\prime}$ be a policy defined as follows:

For every belief state $b\in B$ $\pi^{\prime}(b)=\pi(BlockEdges(\hat{E},b))$ . Meaning, $\pi^{\prime}$ maps every belief state $b$ to an action $a$ by simulating $\pi$ on belief state $b^{\prime}=BlockEdges(\hat{E},b)$ and output $a=\pi(b^{\prime})$ . Clearly, an agent acting according to $\pi^{\prime}$ will never traverse any edge in $\hat{E}$ . We show by induction that $C^{\pi}(b_{2})=C^{\pi^{\prime}}(b_{1})$ as follows,

•

Base case: If $b_{1},b_{2}$ are terminal states then $C^{\pi}(b_{2})=C^{\pi^{\prime}}(b_{1})\equiv 0$ (by definition of terminal states).

•

Assume by induction that for every $b_{1}^{\prime}\in B_{1}$ and $b_{2}^{\prime}\in B_{2}$ we have $C^{\pi}(b_{2}^{\prime})=C^{\pi^{\prime}}(b_{1}^{\prime})$ . By definition of $\pi^{\prime}$ we have $\pi^{\prime}(b_{1})=\pi(b_{2})$ . Let $a=\pi^{\prime}(b_{1})$ . Since $\pi^{\prime}(b_{1})=\pi(b_{2})$ we have $a=\pi(b_{2})$ Hence, according to bellman equations,

[TABLE]

In order to show that $C^{\pi}(b_{2})=C^{\pi^{\prime}}(b_{1})$ we show the equivalence in the right sides of the equations above.

Given $a=Sense(e)$ then,

–

$R(b_{1},a)=R(b_{2},a)=SC(e)$ (action Sense is always performable).

–

$\sum_{b_{1}^{\prime}\in B_{1}}{Tr(b_{1},a,b_{1}^{\prime})C^{\pi^{\prime}}(b_{1}^{\prime})}=\sum_{b_{2}^{\prime}\in B_{2}}{Tr(b_{2},a,b_{2}^{\prime})C^{\pi}(b_{2}^{\prime})}$ . Let $b_{1B}\in B_{1},b_{2B}\in B_{2}$ be the belief states that are reached immediately after the agent has sensed $e$ in $b_{1},b_{2}$ respectively, and e was found to be blocked. By definition of transition function, $Tr(b_{1},a,b_{1B})=Tr(b_{2},a,b_{2B})=p(e)$ . By assumption of induction, $C^{\pi}(b_{1B})=C^{\pi^{\prime}}(b_{2B})$ . Hence, $Tr(b_{1},a,b_{1B})C^{\pi}(b_{1B})=Tr(b_{2},a,b_{2B})C^{\pi^{\prime}}(b_{2B})$ . Similarly, let $b_{1O}\in B_{1},b_{2O}\in B_{2}$ be the belief states that are reached immediately after the agent has sensed $e$ in $b_{1},b_{2}$ respectively, and e was found to be open. Then $Tr(b_{1},a,b_{1O})=Tr(b_{2},a,b_{2O})=1-p(e)$ . By assumption of induction, $C^{\pi}(b_{1O})=C^{\pi^{\prime}}(b_{2O})$ . Hence, $Tr(b_{1},a,b_{1O})C^{\pi}(b_{1O})=Tr(b_{2},a,b_{2O})C^{\pi^{\prime}}(b_{2O})$ .

Thus,

[TABLE]

Where $V_{B}$ denotes $C^{\pi^{\prime}}(b_{1B})$ and $V_{O}$ denotes $C^{\pi^{\prime}}(b_{1O})$ (recall that $C^{\pi^{\prime}}(b_{1B}),C^{\pi^{\prime}}(b_{2B})$ as well as $C^{\pi^{\prime}}(b_{1O}),C^{\pi^{\prime}}(b_{2O})$ are interchangeable).

Given $a=move(e)$ , where $e=\left\langle v_{i},v_{j}\right\rangle$ , then,

–

$R(b_{1},a)=R(b_{2},a)$ . By definition of reward function, for every $b\in B$ $R(b,a)>0$ if and only if $Tr(b,a)>0$ if and only if $Loc(b)=v_{i}$ and $stb(e,b)=Open$ . Thus, $R(b_{1},a)=R(b_{2},a)>0$ if and only if $Loc(b_{1})=Loc(b_{2})$ and $stb(e,b_{1})=stb(e,b_{2})=Open$ . From definition 3.4.3 it follows that $Loc(b_{1})=Loc(b_{2})$ . In addition, $stb(e,b_{1})=Open$ if and only if $stb(e,b_{2})=Open$ , due to the following:

All belief states reachable from $b_{0}$ , and in particular the belief states $b_{1},b_{2}$ , are referred to the same unknown given environment $s$ . Hence, $stb(e,b_{1})=Open$ only if $stb(e,b_{2})=Open$ or $stb(e,b_{2})=Unknown$ and similarly $stb(e,b_{2})=Open$ only if $stb(e,b_{1})=Open$ or $stb(e,b_{1})=Unknown$ . 2. 2.

$stb(e,b_{1})\neq Unknown$ , $stb(e,b_{2})\neq Unknown$ . Since $Loc(b_{1})=Loc(b_{2})$ we have $e\in Inc(Loc(b_{1}))$ if and only if $e\in Inc(Loc(b_{2}))$ . By definition, an agent located in vertex $v$ , knows the status of all edges incident to v. Thus $stb(e,b_{1})\neq Unknown$ , $stb(e,b_{2})\neq Unknown$ .

–

$\sum_{b_{1}^{\prime}\in B_{1}}{Tr(b_{1},a,b_{1}^{\prime})C^{\pi^{\prime}}(b_{1}^{\prime})}=\sum_{b_{2}^{\prime}\in B_{2}}{Tr(b_{2},a,b_{2}^{\prime})C^{\pi}(b_{2}^{\prime})}$ .

Let $B^{\prime}_{1}=\{B_{11},B_{12},...,B_{1n}\}$ be a partition of B by the equivalence relation $\equiv_{\hat{E}}$ such that without loss of generality $B_{1i}=MoreBlocked^{-1}(\hat{E},b_{2i})$ for every $0\leq i\leq n$ . Let $V_{i}=C^{\pi^{\prime}}(b_{2i})$ for every $0\leq i\leq n$ . Then, by assumption of induction, for every $b_{1i}\in B_{1i}$ , we have $C^{\pi}(b_{2i})=C^{\pi^{\prime}}(b_{1i})$ , hence $V_{i}=C^{\pi^{\prime}}(b_{1i})$ as well.

According to lemma 3.4.16, summing up $Tr(b_{1},a,b_{1i})$ over all $b_{1i}\in B_{1i}$ gives,

[TABLE]

Hence,

[TABLE]

Thus, summing up over all transition functions gives,

[TABLE]

This completes the induction proof.

We have shown that for every finite policy $\pi$ we can define a finite policy $\pi^{\prime}$ which satisfies $C^{\pi}(b_{2})=C^{\pi^{\prime}}(b_{1})$ . Since the optimal policy is also finite, the equation $C^{*}(b_{2})=C^{\pi^{\prime}}(b_{1})$ holds. Thus, in general $C^{*}(b_{2})\geq C^{*}(b_{1})$ . ∎

In figure 3.4, we demonstrate the “simulation” of policy $\pi^{\prime}$ presented in theorem 3.4.17. Here, we are given a graph G=(V,E), where $V=\{s,v_{1},v_{2},v_{3},t\}$ (abusing notation, we denote one vertex as s, and one as t),

$E=\{(s,v_{1}),(v_{1},v_{2}),(v_{1},v_{3}),(v_{2},v_{3}),(v_{1},t),(v_{2},t),(v_{3},t)\}$ , where w, which is noted with each edge in the figure, represents the edge weight. In addition, two belief states $b_{1},b_{2}\in B$ are given with the following forms:

$\left\langle b_{1}\right\rangle=\left\langle s,O_{(s,v_{1})},B_{(v_{1},v_{2})},O_{(v_{1},v_{3})},O_{(v_{2},v_{3})},O_{(v_{1},t)},O_{(v_{2},t)},U_{(v_{3},t)}\right\rangle$

$\left\langle b_{2}\right\rangle=\left\langle s,O_{(s,v_{1})},B_{(v_{1},v_{2}}),O_{(v_{1},v_{3})},O_{(v_{2},v_{3})},O_{(v_{1},t)},O_{(v_{2},t)},B_{(v_{3},t)}\right\rangle$ .

On the upper left of the figure, the edges status are based on $b_{1}$ and on the upper right the edges status are based on $b_{2}$ , where the green lines represent open edges, black lines represent blocked edges, and red lines represent unknown edges. Notice that $b_{1},b_{2}$ satisfy $MoreBlocked(b_{1},b_{2})$ and thus $b_{2}=BlockEdges(b_{1})$ . The lower figures represent the execution of policy $\pi^{\prime}$ on $b_{1}$ where $\pi^{\prime}(b_{1})=\pi(b_{2}))$ and $\pi^{*}$ on $b_{2}$ . We see the equivalence between the policies(the same sequence of actions). Notice that agent acting according to $\pi^{\prime}$ (as shown by the doted line), does not perform the action $move(v_{1},t)$ although it is optimal, since $\pi^{\prime}$ treats all edges in $Blocked(b_{2})$ as blocked in $b_{1}$ (edge $(v_{1},t)$ in this figure)

Corollary 3.4.18.

Suppose that $MoreBlocked(b_{1},b_{2})$ is true for $b_{1},b_{2}\in B$ , then if $h(b_{1})$ is an admissible heuristic(optimistic) of $b_{1}$ , then $h(b_{1})$ is an admissible heuristic of $b_{2}$ as well.

Proof.

Follows from theorem 3.4.17 ∎

In the following statement, we use theorem 3.4.17 to show that if $MoreOpen(b_{1},b_{2})$ then $C^{*}(b_{2})$ is a lower bound of $C^{*}(b_{1})$ .

Lemma 3.4.19.

Let $b,b_{open}\in B$ such that $MoreOpen(b,b_{open})$ . Then, $C^{*}(b)\geq C^{*}(b_{open})$ .

Proof.

Let $b,b_{open},b_{blocked}\in B$ such that $b,b_{open},b_{blocked}$ differs only by the status of edge $\hat{e}$ , where $\hat{e}$ is Unknown in $b$ , Open in $b_{open}$ and Blocked in $b_{blocked}$ . We prove that $C^{*}(b)\geq C^{*}(b_{open})$ .

•

By the law of total probability we can express $C^{*}(b_{1})$ as follows:

[TABLE]

From lemma 3.4.3 we get that $C^{*}(b_{1})\leq C^{*}(b_{blocked})$ . Thus, there is $R\geq 0$ such that,

[TABLE]

We can express equation 3.26 as follows:

[TABLE]

Substracting $p(\hat{e})C^{*}(b_{1})$ from both sides and then dividing both sides by $(1-p(\hat{e}))$ , we get:

[TABLE]

Since $R(\frac{p(\hat{e})}{1-p(\hat{e})})\geq 0$ we get

[TABLE]

Trivially, it can be shown by induction that $C^{*}(b)\geq C^{*}(b_{open})$ , for any set of edges $\hat{E}$ such that edges in $\hat{E}$ are unknown in $b$ and open in $b_{open}$ .

∎

Corollary 3.4.20.

Suppose that $b_{1},b_{2}$ satisfy $MoreOpen(b_{1},b_{2})$ , then if $h(b_{2})$ is an admissible heuristic(optimistic) of $b_{2}$ , then $h(b_{2})$ is an admissible heuristic of $b_{1}$ as well.

Proof.

Follows immediately from lemma 3.4.19 ∎

In the rest of the section we provide some new definitions and lemmas in order to prove another lower bound to the cost of optimal policy on belief state $b_{1}$ by a cost of the optimal policy on another belief state $b_{2}$ where, in contrast to the previous lemmas, the locations of the agents in $b_{1}$ and in $b_{2}$ are different.

Definition 3.4.21.

Let $b_{1},b_{2}\in B$ . We define the predicate $DiffLoc(b_{1},b_{2})$ to be true if and only if $Loc(b_{1})\neq Loc(b_{2})$ and for every edge $e\in E$ $stb(e,b_{1})=stb(e,b_{2})$ .

Definition 3.4.22.

Define the set $D_{B}$ to be the set of all pair $b_{1},b_{2}\in B$ such that $DiffLoc(b_{1},b_{2})$ . Meaning $D_{B}=\{<b_{1},b_{2}>|b_{1},b_{2}\in B,DiffLoc(b_{1},b_{2})\}$ . We call $D_{B}$ the DiffLoc of B.

Definition 3.4.23.

Define the set $Open_{b}\subseteq E$ to be the set of all edges that are known to be open in belief state $b$ . Meaning $E_{b}=\{e|e\in E,b\in B,stb(e,b)=Open\}$

Definition 3.4.24.

Let $\mathcal{P}$ be the set of all paths in G and let $D_{B}$ be the DiffLoc of B. We define the function $shortestPath:D_{B}\rightarrow\mathcal{P}$ such that for $<b_{1},b_{2}>\in D_{B}$ $p=shortestPath(<b_{1},b_{2}>)$ is the shortest path between $v_{1}=Loc(b_{1})$ and $v_{2}=Loc(b_{2})$ in graph $G^{\prime}=(V,E_{b_{1}})$ .(Note that $G^{\prime}=(V,E_{b_{1}}$ is a subgraph of G=(V,E) since $E_{b_{1}}\subseteq E$ )

Note that $E_{b_{1}}=E_{b_{2}}$ for every $<b_{1},b_{2}>\in D_{B}$ , since the status of edges specified by $b_{1}$ and $b_{2}$ are equal.

Definition 3.4.25.

Let $\mathcal{P}$ be the set of all paths in G. We define a path cost function $C_{P}:\mathcal{P}\rightarrow\mathcal{R}$ as follows: Let $p$ be a path, then $C_{P}(p)=\sum_{e\in p}{c(e)}$ .

Definition 3.4.26.

We define the set $KE_{b}\subseteq E$ to be the set of all known edges in belief state $b$ . Meaning $KE_{b}=\{e|e\in E,b\in B,stb(e,b)=Open\vee stb(e,b)=Blocked\}$ . $KE_{b}$ is called the knowledge in b.

Lemma 3.4.27.

The value of information in the Canadian Traveler Problem is never less than zero.

Proof.

Let $b_{1},b_{2}\in B$ such that $b_{2}$ is reached from $b_{1}$ immediately after performing SENSE(e) (Suppose hypothetically that an agent in $b_{1}$ is allowed to perform action SENSE(e) once, on any edge $e\in E$ , with no cost) and we get $KE_{b_{1}}\subseteq KE_{b_{2}}$ . Hence, this lemma is true if and only if $C^{*}(b_{1})\geq C^{*}(b_{2})$ (by definition of value of information). Since $KE_{b_{1}}\subseteq KE_{b_{2}}$ , we can simulate any policy of $b_{1}$ on $b_{2}$ by “ignoring” the information received from SENSE(e) in $b_{2}$ and in particular the optimal policy $\pi^{*}$ . Therefore, we can define the policy $\pi_{b_{1}}^{*}$ such that $\pi_{b_{1}}^{*}(b_{2}^{\prime})=\pi^{*}(b_{1}^{\prime})$ for every belief state $b_{1}^{\prime}$ reachable from $b_{1}$ and belief state $b_{2}^{\prime}=b_{{1^{\prime}}_{e}=st_{e}}$ where $st_{e}=SENSE(e)$ . Since $b_{1}$ and $b_{2}$ are referred to the same physical environment, the execution of $\pi^{*}$ on $b_{1}$ will be equal to the execution of $\pi_{b_{1}}^{*}$ on $b_{2}$ (will produce the same sequence of actions). Hence, $C^{\pi^{*}_{b_{1}}}(b_{2})=C^{*}(b_{1})$ , and in general $C^{*}(b_{1})\geq C^{*}(b_{2})$ . ∎

Lemma 3.4.28.

Let $b_{1},b_{2}\in B$ such that $<b_{1},b_{2}>\in D_{B}$ , then $C^{*}(b_{1})+C_{p}(shortestPath(<b1,b2>))\geq C^{*}(b_{2})$ .

Proof outline: In the next lemma we show that $C^{*}(b_{1})+C_{p}(shortestPath(<b1,b2>))\geq C^{*}(b_{2})$ . This gives us a lower bound to $C^{*}(b_{1})$ since $C^{*}(b_{1})\geq C^{*}(b_{2})-C_{p}(shortestPath(<b1,b2>))$ . We show this by defining a policy $\hat{\pi}$ such that when executing $\hat{\pi}$ on $b_{2}$ we have the following: An agent under $\hat{\pi}(b_{2})$ moves through the shortest path(under assumption that all unknown edges in $b_{2}$ are blocked) to the location referred by $b_{1}$ $(Loc(b_{1}))$ , reaching belief state b’, and then under $\hat{\pi}(b^{\prime})$ the agent is followed by the execution of the optimal policy $\pi^{*}$ .

Proof.

Assume(by negation) that

[TABLE]

Let $v_{1}=Loc(b_{1})$ and $v_{2}=Loc(b_{2})$ . We define a new policy $\hat{\pi}$ such that executing it on $b_{2}$ gives the following:

An agent under $\hat{\pi}(b^{\prime})$ traverses the path $p=shortestPath(<b1,b2>)$ straightforward from $v_{2}$ to $v_{1}$ (which is always possible since all edges in p are open). Let $b^{\prime}$ be the belief state that the agent reaches when arriving $v_{1}$ . 2. 2.

Immediately after reaching $b^{\prime}$ , the agent under $\hat{\pi}(b^{\prime})$ acts according to the optimal policy until reaching t. Meaning, for any belief state b” reachable from $b^{\prime}$ $\hat{\pi}(b^{\prime\prime})=\pi^{*}(b^{\prime\prime})$ .

Clearly,

[TABLE]

We claim that $KE_{b_{1}}\subseteq KE_{b^{\prime}}$ . This result from:

$KE_{b_{1}}=KE_{b_{2}}$ . The knowledge in $b_{1}$ and $b_{2}$ is equal by definition of element pairs $<b_{1},b_{2}>\in D_{B}$ . 2. 2.

$KE_{b_{2}}\subseteq KE_{b^{\prime}}$ . An agent in $b_{2}$ that follows the shortest path p may obtain information if a vertex in path p(a vertex that is incident to two edges in p) is incident to an edge that has not been sensed yet.

Since an agent A1 in $b_{1}$ and an agent A2 in $b^{\prime}$ are at the same physical state $s\in S$ , and the knowledge of A1 about s is a subseteq of the knowledge of A2 about s $(KE_{b_{1}}\subseteq KE_{b^{\prime}})$ we get $C^{*}(b^{\prime})\leq C^{*}(b_{1})$ (followed by the lemma of value of information).

Thus,

[TABLE]

by equation 3.28 we get,

[TABLE]

Following assumption 3.27 we get,

[TABLE]

which is a contradiction to the optimality of policy $\pi^{*}$ . Hence,

[TABLE]

∎

Corollary 3.4.29.

Let $b_{1},b_{2}\in B$ such that $<b_{1},b_{2}>\in D_{B}$ , then if $h(b_{1})$ is a lower bound to $C^{*}(b_{1})$ , then $h(b_{1})$ is a lower bound to $C^{*}(b_{2})+C(shortestPath<b_{1},b_{2}>)$ as well.

Proof.

Follows immediately from lemma 3.4.28 ∎

We want to define two relations which will be used in the next section:

Chapter 4 Generalizing PAO*

4.1 General Propagation AO*

In many cases, PAO* lowers dramatically the running time by reducing the state space, however, it assumes that each vertex is connected to at most one unknown edge, such that each AND node in the AND/OR graph has at most two successors. We present the generalized propagation AO* algorithm (Gen-PAO in short), a generalization of PAO*, which does not assume any preknown knowledge of the graph (except the edges incident to $s$ which are always defined as Open). Gen-PAO solves the Sensing-CTP as well. Each sensing action is associated with a sensing AND node, where each sense node has only two children nodes for the two possible statuses of the sensed edge (Open/Blocked). This variant is extremely harder than the basic CTP since the agent can sense any unknown edge in any state and hence, the branching factor of the OR nodes is significantly larger.

4.1.1 Gen-PAO Heuristics

Similarly to AO* and PAO*, each iteration of Gen-PAO is based on two phases: Expansion and Propagation. Gen-PAO differs from AO* and PAO* only in the Propagation phase (i.e, the Main and Expand method as presented in algorithm 2.1 are part of Gen-PAO as well). However, in the propagation phase, Gen-PAO propagates the heuristic values not only upwards to the ancestors as AO*, but to the entire state space, incorporating three novel heuristics: HBlocked, HOpen, and HDiffLoc (line 13). The heuristic HBlocked is based on the predicate MoreBlocked (definition 3.4.3), HOpen is based on the predicate MoreOpen (definition 3.4.4), and HDiffLoc is based on the predicate Diffloc(definition 3.4.21). Let $Z$ be the set of belief states that expanded by Gen-PAO and let $b\in Z$ . The heuristics are defined as follows:

•

HBlocked(b):* If there is a belief state $b^{\prime}\in Z$ that satisfies $MoreBlocked(b,b^{\prime})$ and $h(b)>h(b^{\prime})$ then $h(b^{\prime})\leftarrow h(b)$ *.

•

HOpen(b):* If there is a belief state $b^{\prime}\in Z$ that satisfies $MoreOpen(b^{\prime},b)$ and $h(b^{\prime})<h(b)$ then $h(b^{\prime})\leftarrow h(b)$ *.

•

HDiffLoc(b):* If there is a belief state $b^{\prime}\in Z$ that satisfies $DiffLoc(b,b^{\prime})$ and $h(b^{\prime})<h(b)-CSP(b,b^{\prime})$ then $h(b^{\prime})\leftarrow h(b)-CSP(b,b^{\prime})$ *, where CSP is the cost of the shortest path from b to b’.

Belief states of which values are updated due to propagation from b are called propagated belief state of b. Notice that $HBlocked(b),HOpen(b)$ , and $HDiffLoc(b)$ always raise up the heuristic value of the propagated belief states of b. However, due to corollaries $\ref{corMoreBlocked},\ref{corMoreOpen}$ , and 3.4.29 respectively, heuristics $HBlocked(b),HOpen(b)$ , and $HDiffLoc(b)$ are admissible, and thus they are upper bounded by $V^{*}(b)$ .

Figure 4.1 illustrate an update of belief state $b_{2}$ by the three heuristic methods (the the new heuristic values of $b_{2}$ are notified in parenthesis)

The heuristic methods are invoked when a value of belief state is updated(procedure Propagate, line 13). The heuristics methods are ineffective on a major part of the expanded states (i.e. most of the expanded states $b^{\prime}\in Z$ do not satisfy the predicates $MoreBlocked(b,b^{\prime}),MoreOpen(b,b^{\prime})$ , and $DiffLoc(b,b^{\prime})$ , for a given expanded state $b$ , and their values are not updated by their compatible heuristic methods). In order to reduce the number of expanded states that are checked for update, we use two data structures: $BlockedStructue$ and $OpenStucture$ . For defining these structures we define new equivalence relations: The equivalence relations $\equiv_{o}$ and $\equiv_{b}$ on belief states $b_{1},b_{2}$ are defined as follows:

Definition 4.1.1.

$b_{1}\equiv_{o}b_{2}$ * if and only if:*

$Loc(b_{1})=Loc(b_{2})$ ** 2. 2.

$Open(b_{1})=Open(b_{2})$ **

Similarly,

Definition 4.1.2.

$b_{1}\equiv_{b}b_{2}$ * if and only if:*

$Loc(b_{1})=Loc(b_{2})$ ** 2. 2.

$Blocked(b_{1})=Blocked(b_{2})$ **

Each of these structures is a hash table that contains the entire expanded state space $Z$ (more precisely the hash table refers to $Z$ ), where the entires of each table divide $Z$ into equivalence classes called “buckets”. Namely, the set of buckets $\{m_{o1},...,m_{on}\}$ , in $BlockedStructue$ , partitions $Z$ in to equivalence classes by the relation $\equiv_{o}$ , while the set of buckets $\{m_{b1},...,m_{bn}\}$ , in $OpenStucture$ , partitions $Z$ into equivalence classes by the relation $\equiv_{b}$ . By definition above, $HBlocked(b)$ , never updates the heuristic value $h(b^{\prime})$ if b and b’ do not share the same bucket of $BlockedStructue$ , and similarly, $HOpen(b)$ never update a value of b’ if b and b’ do not share the same bucket of $OpenStructue$ .

Procedure $propBlocked$ (Algorithm 4.2) implements the heuristic $HBlocked$ . $classBlocked(b)$ (line 1) returns the set $Z_{B}$ of all expanded nodes whose belief states are in the same bucket of $BlockedStructue$ as the belief state of node $z$ . Similarly, procedure $propOpen$ is implementation of the heuristic $HOpen$ . $classOpen(b)$ (line 1) returns the set of all expanded nodes $Z_{O}$ whose belief states that are at the same bucket of $OpenStructue$ as the belief state of node $z$ .

4.1.2 Eliminating Duplicate Nodes

In most cases Gen-PAO expands the same node more than once. This may lead to a large expense of memory and run time when it is generated on large graphs. Taking this into consideration, we introduce the Gen-PAO-EDN (short for Gen-PAO Eliminating Duplicate Nodes) algorithm, a variation of Gen-PAO, that maintains a single OR node for every state, by eliminating all duplicate OR nodes (more precisely, preventing the expansion of duplicate nodes) which shares the same state into one OR node. There are two key differences between Gen-PAO-EDN an Gen-PAO:

•

Gen-PAO maintains one representative OR node for each state. When Gen-PAO-EDN expands an AND node, it creates a new OR node only if its associated state is not represented by any OR node in the AND/OR graph. Otherwise, if a representative OR node to this state already exists, then the expanded AND node becomes an additional parent of the representative OR node.

•

The AND/OR graph may contain cycles(not a tree as in AO* and Gen-PAO). A special type of cycle, called strongly connected (defined below), induces loops in the propagation phase if the cycle is a subgraph of the partial solution.

Definition 4.1.3.

Let $AO$ be an AND/OR graph, $A_{1},...A_{n}\in AO$ be AND nodes, and $O_{1},...,O_{n}\in AO$ be OR nodes. A cycle $O_{1}->A_{1}->O_{2}->A_{2}->...->O_{n}->A_{n}->O_{1}$ is strongly connected if for every $1\leq i\leq n$ $A_{i}$ is the preferred son of $O_{i}$ .

If the propagate method enters a strongly connected circle $C$ (which occurs when the propagation goes upwards to the ancestors), the heuristic values are re-updated every iteration, where each update raises up a bit the values of the nodes in .

In some point on of the following eventually happens:

The value of one of the AND nodes in $C$ is raised up to a level that it ceased to be the preferred successor of its OR parent. Namely, in some point in the process of update, there is an AND node $n$ , with a sibling $n^{\prime}$ , such that $h(n^{\prime})<h(n)$ . Then $n^{\prime}$ becomes the preferred son. Hence, the cycle is no longer a strong connected and the loop ends. 2. 2.

The propagation process in $C$ raises up the values of the nodes in $C$ until the values are converged to a certain finite limit.

Clearly, if the values of nodes in $C$ are not converged then case 1 must hold. In case 2, the propagation may enter into an endless loop if the values are not converged in any finite iteration. In order to overcome this, each time the value of a node $n\in Z$ is updated, we check the delta $\Delta(n)=h(n)-h_{prev}(n)$ , and stop the loop if $\Delta(n)<\epsilon$ , where $\epsilon$ is a small positive constant which is chosen before the run, and $h_{prev}(n)$ is the value of $n$ before the update. It should be noted that $\epsilon$ is defined to be so small, that it does not change the propagation process(i.e if case 2 holds then case 1 does not hold even if the loop would have never been stopped).

As we proposed a unifying approach to Gen-PAO, we now propose a unifying approach to the AO*. The algorithm AO-EDN is an improvement of the AO* algorithm in which unifies the OR nodes that associate with the same state. In fact AO-EDN is the same algorithm as Gen-PAO-EDN despite that it does not include heuristic HBlocked and HOpen in the propagation phase.

Chapter 5 Empirical Results

In order to evaluate our scheme we implemented alternative algorithms for the Gen-PAO and compared them by their execution time and by the size of their generated AND/OR graph (defined as the number of its nodes). Note that although the size of the AND/OR graph and the run time of Gen-PAO-EDN may decrease as a result of the heuristic propagation and nodes unification, still the algorithms described in this section requires a time exponential in the number of unknown edges, which makes this approach prohibitive for graphs with large sets of unknown edges.

5.1 Varying the Uncertainty of the Graph

In the first two experiment we explored how the uncertainty of the graph affect the performance of AO* (Section 2.4), GenPAO (Section 4.1), and AO-EDN (Section 4.1.2). The performance of each algorithm was measured for different graph sizes where each graph had different number of unknown edges. To ensure that the experiments could be performed within a reasonable time frame, the parameters were chosen so that a single run takes no more than few minutes.

Figure 5.1 compares the performance of the algorithms above on instances of basic-CTP. Figure 5.1a and figure 5.1b show respectively the change in the execution time and in the size of AND/OR graph as the number of unknown edges ascend from 2 to 12. This comparison indicates that Gen-PAO has a significant advantage in execution time over AO* since the embedded heuristics in Gen-PAO lowers dramatically the size of the AND/OR graph. Moreover, Gen-PAO has a slight advantage in execution time over AO-EDN although the size of the AND/OR graph generated by AO-EDN is smaller than the graph generated by Gen-PAO. The increased execution time of AO-EDN is incurred by the overhead of the iterative propagation in the redundancy elimination process (Section 4.1.2) in which depends on the value of default edge (default edge cost was chosen to be 100).

Figure 5.2 shows the comparison between AO-EDN and Gen-PAO on instances of Sense-CTP (the sensing cost was fixed to 0.5 for all edges). AO* was discarded from this comparison due to an extremely large execution time. In contrast to previous comparison, here AO-EDN outperforms Gen-PAO in AND/OR graph size (figure 5.2b) as well as in execution time (figure 5.2a). The elimination of redundancy nodes provides an advantage despite the overhead, since the number of expansions saved by the unification increases considerably as the number of unknown edges ascends. The plot does not contain more than 7 unknown edges since Gen-PAO consumes all the RAM on larger graphs.

It should be mentioned that since the performances of Gen-PAO-EDN (Section 4.1.2) and Gen-PAO are almost identical on instances of basic-CTP and Sense-CTP, the performance of Gen-PAO-EDN is not presented.

5.2 Gen-PAO Heuristic Estimate

5.2.1 Experimental Setting

We now define a variant of the Canadian Traveler Problem called Expensive Edges CTP (Exp-CTP in short). Exp-CTP is defined as CTP, except that each edge $e\in E$ can be $expensive/cheap$ instead of $blocked/unblocked$ . Formally, Expensive-Edge-CTP is a 6 tuple $I=(G,P,w,s,t,DC)$ where $G=(V,E)$ is a graph, P and w are respectively the probability and cost functions over the edges, $s,t\in V$ are the start and goal vertices, and $DC$ is a positive real number. $P(e)$ denote the probability that $e$ is cheap and $1-P(e)$ denote the probability that $e$ is expensive. An agent can traverse edge $e\in E$ whether its cheap or expensive. However, if the agent traverses $e$ and $e$ is cheap then it pays $w(e)$ , and if $e$ is expensive then it pays $DC$ , where $DC$ (short for Detour cost) is a fixed cost which is higher than any edge cost(except the cost of the default edge $\left\langle s,t\right\rangle$ ). In fact Exp-CTP can be defined as a subclass of CTP as well, where every unknown edge $\left\langle v_{i},v_{j}\right\rangle$ in G, has a parallel path $l_{ij}=\left\langle\left\langle v_{i},v_{k}\right\rangle,\left\langle v_{k},v_{j}\right\rangle\right\rangle$ called detour path such that the path cost of $l_{ij}$ is DC and $l_{ij}$ is always traversable. Namely,

•

$w(\left\langle v_{i},v_{k}\right\rangle)=DC$ and $w(\left\langle v_{k},v_{j}\right\rangle)=0$

•

$P(\left\langle v_{i},v_{k}\right\rangle)=1$ and $P(\left\langle v_{k},v_{j}\right\rangle)=1$

To evaluate the performance of Gen-PAO heuristics we implemented four alternative algorithms for Gen-PAO-EDN, where on each algorithm, different heuristic was embedded in the propagation phase. Since the heuristics has almost no impact when Gen-PAO-EDN is applied on instances of basic-CTP and sense-CTP, the algorithms were executed on instances of Exp-CTP. The implemented algorithms are as follows:

•

PAO-Blocked - Gen-PAO-EDN which propagates the heuristic values according to HBlocked (Section 4.1).

•

PAO-Open - propagates the heuristic values according to HOpen (Section 4.1).

•

PAO-All - propagates the heuristic values according to HOpen and HBlocked.

•

PAO-None - basic propagation with no heuristic included.

5.2.2 Varying the Sensing Cost

In order to learn the effect of the sensing cost on the algorithms performance we conducted several runs using different fixed sensing cost(the sensing cost was equal for of all edges) on a graph that consists 8 vertices and 13 edges (10 edges are unknown). In all experiments, the probability of all unknown edges was fixed to 0.5. Figure 5.3a shows the change in the size of AND/OR graph as the sensing cost ascends from 0.1 to 1.1. This result indicates that the size of AND/OR graph (generated by all variants of Gen-PAO) decreases, as the sensing cost increases . We believe that this can be attributed to the increased number of expanded states in the AND/OR graph incurred by the low sensing cost, in which makes the sensing action worthwhile. In particular, there exists a limit $m$ , such that every sensing cost below $m$ makes the Sense actions always preferable over the Move actions. This causes many expansions of Sense nodes and expansion of new belief state (that are not reachable without preforming Sense) which results in a large AND/OR graph. The comparison of the algorithms shows that PAO-None generates a relatively small AND/OR graph for low sensing cost, while PAO-Blocked and PAO-All has advantage on high sensing cost. This is also true for larger graphs that contain larger sets of unknown edges. We believe that this effect can be explained by the fact that on low levels of sensing cost, it is worthwhile to sense unknown edges, in which improves the estimate accuracy of the heuristic values(on low cost). The high accuracy level of the heuristic estimate leads to low rates of pruning since the heuristics HBlocked and HOpen are based the gap between the real and estimated value, which is small in this case. Thus, a large AND/OR graph was obtained. A comparison of the run time (figure 5.3b) shows that the run time extends as the size of the AND/OR graph increases. The reason for this positive correlation is obvious: the increased size of the graph leads to larger computation time required for expanding the states, as well as for propagating the heuristic values to a larger set of states.

5.2.3 Varying the Open Probability

In this experiment we investigated the effect of distribution over the edges on the performance of variants of Gen-PAO-EDN (Section 4.1.2). In order to perform simple experiment that analyzes this effect, we configured the graph such that all unknown edges was open with the same value of fixed probability, called open probability, which is given as an input. Figure 5.4 illustrates the performance of different heuristics on a graph that consists 19 unknown edges for DC=7 and DC=9 . Figures 5.4a and 5.4c show the change in the size of the AND/OR graph size as the open probability ascends from 0.1 to 0.9. These results indicates that for all algorithms there exists a certain value of open probability $p$ (p=0.5 on figure 5.4a and p=0.3 on figure 5.4c) such that for any value of open probability $p^{\prime}$ (called low open probability) smaller than $p$ the size of the AND/OR graph increases as $p^{\prime}$ rises, while for any value of open probability $p^{\prime\prime}$ larger than $p$ (called high open probability) the size of the AND/OR graph decreases as the $p^{\prime\prime}$ rises. We call $p^{\prime}$ low open probability and $p^{\prime\prime}$ high open probability

This can be explained by the following reasons(referred to AND/OR graphs generated by all algorithms):

•

On high open probability most of the decision nodes (OR nodes) decides correctly their best action node when first expanded without changing their decision afterwards, and thus, relatively large portion of the expanded states is also a part of the optimal policy graph and the AND/OR graph is relatively small. However, as the open probability lowers, the AND/OR graph size increases since the heuristic estimates are less accurate and more alternative actions are considered for the optimal policy. This leads to an excessive expansion of nodes and a larger AND/OR graph.

•

On low open probability, as the open probability lowers, the graph becomes “more blocked”, and the default path becomes preferable. In such cases, all variants of Gen-PAO-EDN tend to prune action nodes that are not associated with the default path (sensing or traversing edges that are not in the default path) and, as a result, a smaller AND/OR graph is obtained.

The comparison between the heuristic of Gen-PAO-ELN shows advantage of PAO-Blocked and PAO-All on low open probabilities. This is due to the high pruning rate incurred by HBlocked on low open probability, where the gap between the heuristic values the real values are high. Again HBlocked is effective since the chances that heuristic value of different belief state will be updated are high (see conditions of HBlocked in section 4.1).

Figures 5.4b and 5.4d show the time spent by the four algorithms. As in previous experiments, there is a tight correlation between the execution time and the AND/OR graph size. The size of the AND/OR graphs generated by PAO-Blocked and PAO-All are smaller then PAO-None on all levels of open probability, however, the advantage on runtime of PAO-Blocked and PAO-All occurs on low open probability.

5.3 Value of Clairvoyance

In order to get some general indication of the total value of information, we checked the ratio (see Papadimitriou 1991), denoted by $RV$ , on instances of basic CTP and Exp-CTP. $RV$ is defined as $\frac{C^{*}}{AS}$ where $C^{*}$ is the expected cost of the optimal policy and AS is the expected cost of the optimal policy given that the graph is fully observable (can be also described as the expected cost of the policy Always Sense when the sensing cost is 0 (see [Bnaya,Felner and Shimony]). Formally, Let $l_{1},l_{2},...,l_{n}$ be the paths in the graph ordered by their path cost, $P_{i}$ be the probability that path $l_{i}$ is traversable, and $C_{i}$ be the path cost of path $l_{i}$ then $AS$ can be described as follows:

[TABLE]

We performed experiments on instance of basic CTP for different values of open probabilities and values of the default edge. Results for graph 7V11E (figure 5.5) shows that $RV$ is relatively high on low values of the default edge (where default edge cost is 20). This can be explained by the fact that $AS$ is relatively low since the agent would not traverse the default edge if there exists an open path to the target (in addition to the default edge) however $C^{*}$ is almost high as the cost of default edge since it is usually worthwhile to traverse the default edge when MaxEdge is low (note that $C^{*}$ is always lower than MaxEdge). In addition, $RV$ is high on low open probabilities (i.e. on $p\in[0.1,0.3]$ ) , since the the graph “tends” to be blocked and the default edge is preferable over the “cheap” paths. Tough on extremely high cost of the default edge (not illustrated in the figure), i.e. on $MaxEdge>300$ , RV is low even on low open probability (around 1.3), since the agent takes the default edge only if there is no open path other then the default edge.

An analogue experiment was performed on instanfce of Exp-CTP for the same graph as used on previous experiment. RV was measured for different values of DC and open probabilities while default edge cost remained fixed (default edge cost is 200). Figure 5.6 shows that the result is qualitatively similar to the results of the previous experiment, however lower value of RV were obtained in all domain. The reason for this similarity is the same as in the previous experiment, despite that now, the agent prefers to traverse the detour path instead of the default edge. RV is lower than in previous experiment since the paths cost, on average, is higher (it is sometimes required to pay DC several times) and thus $AS$ is higher.

Chapter 6 Summary

6.1 Contributions

In this thesis we explored the Canadian traveler problem theoretically and empirically. In the context of theoretical analysis the following theorems has been proved:

•

Correlated-CTP is at least as hard as Sensing-CTP.

•

CTP-PATH-DEP is NP-hard.

•

CTP-FOR-DEP is solvable in polynomial time.

•

Properties of Belief MDP for CTP.

The main aspect of the practical analysis is the framework of Gen-PAO, where its main contributions are:

•

Gen-PAO extends the PAO* algorithm such that it is not restricted to special types of graphs.

•

Gen-PAO optimally solves instances Exp-CTP and sensing CTP in addition to basic CTP.

•

Two heuristics HBlocked and HOpen have been proposed. HBlocked and HOpen can be plugged in Gen-PAO and in some cases reduces the size of the AND/OR graph and the execution time.

In addition, we analyzed the parameter RV for instances of Exp-CTP and basic CTP and showed its general behivior.

6.2 Future work

There is a lot remained to be done in theoretical analysis of the CTP, and in particular classifying other subclasses of the CTP. On the practical aspect, Gen-PAO can be further modified to solve other type of CTP such as Correlated CTP and multi-agent CTP. Moreover, we believe that Gen-PAO can be further enhanced by aiming it to other type of POMDP problems. It might be worth consideration to improve the performance of Gen-PAO by implementing heuristics that specialize in specific type of graphs.

Bibliography6

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. F. Anthony, A. Stentz, and S. Thrun. Pao* for planning with hidden state. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation , pages 2840–2847, 2004.
2[2] A. Bar-Noy and B. Schieber. The canadian traveller problem. In SODA , pages 261–270, 1991.
3[3] Z. Bnaya, A. Felner, and S. E. Shimony. Canadian traveler problem with remote sensing. In IJCAI , pages 437–442, 2009.
4[4] S. Koenig, C. A. Tovey, and Y. V. Smirnov. Performance bounds for planning in unknown terrain. Artif. Intell. , 147(1-2):253–279, 2003.
5[5] E. Nikolova and D. R. Karger. Route planning under uncertainty: The canadian traveller problem. In AAAI , pages 969–974, 2008.
6[6] C. Papadimitriou and M. Yannakakis. Shortest paths without a map. Theor. Comput. Sci. , 84(1), 1991.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Thesis for the degree Master of Science

Abstract

Contents

Chapter 1 Introduction

Chapter 2 Background

2.1 Markov Decision Process

2.1.1 Policy Iteration

2.2 Partially Observable Markov Decision Process

2.2.1 Value Iteration

2.3 The Canadian Traveler Problem

2.3.1 CTP with Dependencies in Disjoint Path Graphs

2.4 AND/OR Graphs

2.4.1 AO*

2.4.2 CTP and AND/OR graphs

Definition 2.4.1**.**

2.5 Models for the Canadian Traveler Problem

2.5.1 POMDP for CTP

Notation 2.5.1**.**

Notation 2.5.2**.**

Notation 2.5.3**.**

2.5.2 Belief State for Representing the Environment of CTP

Definition 2.5.4**.**

Definition 2.5.5**.**

Definition 2.5.6**.**

Definition 2.5.7**.**

Definition 2.5.8**.**

Corollary 2.5.9**.**

Definition 2.5.10**.**

Corollary 2.5.11**.**

2.5.3 Belief MDP for CTP

Definition 2.5.12**.**

2.6 Related Work

2.6.1 Different Variation of CTP

2.6.2 Disjoint Path Graphs

2.6.3 CTP with Sensing

Heuristic search algorithms

2.6.4 Propagating AO*

Chapter 3 Theoretical Analysis of CTP

3.1 CTP with Dependencies

Theorem 3.1.1**.**

Notation 3.1.2**.**

Notation 3.1.3**.**

Notation 3.1.4**.**

Definition 3.1.5**.**

Lemma 3.1.6**.**

Proof.

Corollary 3.1.7**.**

Definition 3.1.8**.**

Lemma 3.1.9**.**

Proof.

Lemma 3.1.10**.**

Proof.

Lemma 3.1.11**.**

Proof.

Definition 3.1.12**.**

Definition 3.1.13**.**

Definition 3.1.14**.**

Definition 3.1.15**.**

Lemma 3.1.16**.**

Lemma 3.1.17**.**

Lemma 3.1.18**.**

Proof.

Lemma 3.1.19**.**

Proof.

Lemma 3.1.20**.**

Proof.

Lemma 3.1.21**.**

Proof.

3.2 CTP-Forward-Arcs

Definition 3.2.1**.**

Definition 3.2.2**.**

Theorem 3.2.3**.**

Proof.

Definition 2.4.1.

Notation 2.5.1.

Notation 2.5.2.

Notation 2.5.3.

Definition 2.5.4.

Definition 2.5.5.

Definition 2.5.6.

Definition 2.5.7.

Definition 2.5.8.

Corollary 2.5.9.

Definition 2.5.10.

Corollary 2.5.11.

Definition 2.5.12.

Theorem 3.1.1.

Notation 3.1.2.

Notation 3.1.3.

Notation 3.1.4.

Definition 3.1.5.

Lemma 3.1.6.

Corollary 3.1.7.

Definition 3.1.8.

Lemma 3.1.9.

Lemma 3.1.10.

Lemma 3.1.11.

Definition 3.1.12.

Definition 3.1.13.

Definition 3.1.14.

Definition 3.1.15.

Lemma 3.1.16.

Lemma 3.1.17.

Lemma 3.1.18.

Lemma 3.1.19.

Lemma 3.1.20.

Lemma 3.1.21.

Definition 3.2.1.

Definition 3.2.2.

Theorem 3.2.3.

Definition 3.2.4.

Lemma 3.2.5.

Lemma 3.2.6.

Definition 3.2.7.

Lemma 3.2.8.

Definition 3.2.9.

Definition 3.2.10.

Lemma 3.2.11.

Corollary 3.2.12.

Definition 3.3.1.

Theorem 3.3.2.

Definition 3.3.3.

Lemma 3.3.4.

Notation 3.3.5.

Lemma 3.3.6.

Definition 3.4.1.

Notation 3.4.2.

Definition 3.4.3.

Definition 3.4.4.

Definition 3.4.5.

Property 3.4.6.

Definition 3.4.7.

Property 3.4.8.

Definition 3.4.9.

Definition 3.4.10.

Definition 3.4.11.

Definition 3.4.12.

Definition 3.4.13.

Definition 3.4.14.

Lemma 3.4.15.

Lemma 3.4.16.

Theorem 3.4.17.

Corollary 3.4.18.

Lemma 3.4.19.

Corollary 3.4.20.

Definition 3.4.21.

Definition 3.4.22.

Definition 3.4.23.

Definition 3.4.24.

Definition 3.4.25.

Definition 3.4.26.

Lemma 3.4.27.

Lemma 3.4.28.