Theoretical and Experimental Analysis of the Canadian Traveler Problem
Doron Zarchy

TL;DR
This paper provides a comprehensive theoretical and experimental analysis of the Canadian Traveler Problem, introducing a new variant with dependencies, and proposing an optimal algorithm that outperforms existing methods.
Contribution
It introduces Dep-CTP with dependencies, proves its intractability, and develops Gen-PAO, an optimal algorithm for multiple CTP variants with improved performance.
Findings
Dep-CTP is intractable.
Gen-PAO solves multiple CTP variants optimally.
Pruning methods improve search efficiency.
Abstract
Devising an optimal strategy for navigation in a partially observable environment is one of the key objectives in AI. One of the problem in this context is the Canadian Traveler Problem (CTP). CTP is a navigation problem where an agent is tasked to travel from source to target in a partially observable weighted graph, whose edge might be blocked with a certain probability and observing such blockage occurs only when reaching upon one of the edges end points. The goal is to find a strategy that minimizes the expected travel cost. The problem is known to be P hard. In this work we study the CTP theoretically and empirically. First, we study the Dep-CTP, a CTP variant we introduce which assumes dependencies between the edges status. We show that Dep-CTP is intractable, and further we analyze two of its subclasses on disjoint paths graph. Second, we develop a general algorithm Gen-PAO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTransportation Planning and Optimization · Energy, Environment, and Transportation Policies · Transportation and Mobility Innovations
MethodsPruning
Thesis for the degree Master of Science
Theoretical and Experimental Analysis of the Canadian Traveler Problem
Submitted by: Doron Zarchy
Advisor: Prof. Eyal Shimony
Department of Computer Science
Faculty of Science
Ben Gurion University of the Negev
Abstract
Devising an optimal strategy for navigation in a partially observable environment is one of the key objectives in AI. One of the problem in this context is the Canadian Traveler Problem (CTP). CTP is a navigation problem where an agent is tasked to travel from source to target in a partially observable weighted graph, whose edge might be blocked with a certain probability and observing such blockage occurs only when reaching upon one of the edges end points. The goal is to find a strategy that minimizes the expected travel cost. The problem is known to be P hard. In this work we study the CTP theoretically and empirically. First, we study the Dep-CTP, a CTP variant we introduce which assumes dependencies between the edges status. We show that Dep-CTP is intractable, and further we analyze two of its subclasses on disjoint paths graph. Second, we develop a general algorithm that optimally solve the CTP called General Propagating AO* (Gen-PAO). Gen-PAO is capable of solving two other types of CTP called Sensing-CTP and Expensive-Edges CTP. Since the CTP is intractable, Gen-PAO use some pruning methods to reduce the space search for the optimal solution. We also define some variants of Gen-PAO, compare their performance and show some benefits of Gen-PAO over existing work.
Contents
Chapter 1 Introduction
Planning under uncertainty is one of the most investigated problems in AI. In the real world, efficient navigation requires operation in a partially unknown or dynamically changing environment. Consider a situation where a taxi driver wants to reach his destination in the city in the shortest possible time. The experienced driver knows the road map, and length of each road. Still, the driver does not necessarily have a complete knowledge of the roads’ current status. Some of the roads may be blocked due to traffic jams or police blockades. The driver needs to devise a strategy to reach the destination in the shortest expected time.
A formal model for this kind of problem is the Canadian Traveler Problem (CTP). CTP is a stochastic navigation problem, introduced by [6] where an agent is aimed to travel in a weighted graph from a source vertex to a target vertex . Each time the agent traverses an edge it pays a travel cost which is defined by the edge weight. The agent has complete knowledge of the graph structure and the edge costs. However, some of the edges may be blocked with a known probability. The agent observes such blockage only when the agent physically arrives a vertex that is incident to that edge. The goal is to find a strategy for reaching from to that minimizes the expected cost.
[6] showed that finding the optimal solution for the Canadian Traveler Problem was shown to be P hard . However some special classes of CTP such as CTP on disjoint path graphs
and CTP on directed acyclic graphs are solvable in polynomial time [5, 3].
In this work, we explore certain variations of the CTP. The first variation introduced is CTP with Dependencies (Dep-CTP). In the original problem, the distribution over the edges is independent. Dep-CTP is a generalization of CTP where we assume that dependencies exist between the status of a particular edge with the status of other edges. Specifically, we are given a Bayesian network that defines the dependencies between the edges. The second variant is CTP with remote sensing(CTP with sensing), introduced by [3]. In CTP with sensing, an agent may perform sensing on any edge, with a given sensing cost, in order to reveal its status. The third variant is Expensive-Edge CTP, a variant of CTP in which edges cannot be blocked, but are expensive and incurs a high travel cost when traversed.
This work contains two different approaches for studying the CTP, by theoretical analysis and by experimental analysis. Regarding the theoretical aspect, we attempt to classify certain classes of Dep-CTP by their computational complexity using probabilistic models as belief-MDP and AND/OR graphs, and we show some general properties for CTP with sensing. Regarding the empirical aspect, we introduce the Gen-PAO algorithm, a generalization of PAO* [1] that optimally solves the CTP, CTP with sensing and Exp-CTP. Gen-PAO uses several pruning methods to reduce the size of the state space search and running time. In addition, we explore the value of clairvoyance which represent the value of having full knowledge of the graph.
The remainder of this work is organized as follows: Chapter 2 contains formal definitions of the Canadian Traveler Problem and its variants. In addition it contains a background for decision and probabilistic models and reviews a number of related algorithms. Chapter 3 shows some proofs concerning the hardness of Dep-CTP and for two of its subclasses. In addition, some theoretical properties concerning the CTP with sensing are shown. Chapter 4 introduces the Gen-PAO algorithm and some of the pruning methods it uses. Chapter 5 provides empirical results, comparing the performance of Gen-PAO and some of its variants. In addition, results concerning the value of Clairvoyance are presented. Chapter 6 summarizes this work and discusses possible directions for future research. Appendix A presents some of the instances in which empirical analysis is used .
Chapter 2 Background
2.1 Markov Decision Process
A Markov Decision Process(MDP) is a framework for sequential stochastic decision problems with a fully observable environment. Formally, MDP is defined by the tuple , where
- •
S is a finite set of states where describes the environment at a specific time step.
- •
A is a set of actions.
- •
is the transition function where specifying the probability of entering a state given the previous state and the action .
- •
is the reward function which specifying the reward that is received by transitioning from state to state after performing action in s.
At each time step, the agent is state chooses an action , reaches the state with probability T(s,a,s’), and obtain reward R(s,a,s’).A deterministic variant to MDP defines deterministic actions, where each pair of action and state specifies deterministically the result state i.e. There exist a state in which and for every . The model assumes that the transitions are Markovian in a sense that the probability of reaching a state depends only on the previous state and the action instead of a history of earlier states. The solution of the MDP is a policy. A policy is a mapping from a set of states to a set of actions. At each time step, a given policy is executed, starting from an initial state . By having a complete policy, the agent will always know what to do next. However, the stochastic nature of the environment will lead to a different environment history. The decision making problem may be a finite horizon or an infinite horizon. A finite horizon constrains the time steps that the agent exists (or equivalently, considers the rewards after time N as zeros). In this case the utility function is usually an additive reward function:
[TABLE]
However, in infinite horizon the time sequence is unbounded. In infinite horizon the utility function is computed with discount rate :
[TABLE]
This utility is called discounted reward. Usually, the performance of an agent is the utility of sequence of states, which is measured by the sum of rewards for the states visited. The utility of a policy in state s is the expected cost over all possible state sequences, starting from s until the MDP terminates. The utility of a policy in finite horizon is computed using dynamic programming:
[TABLE]
Where the utility of a policy in finite horizon is given by,
[TABLE]
In an infinite horizon we usually have a terminal state. The optimal policy is a policy that yields the highest expected utility(or lowest, depends on the specification of the problem). Given a policy , the value of the policy in state s can be computed by an algorithm called value iteration. The value iteration algorithm computes the value of every state s under policy using a reasoning process that goes backwards in time, from the end, in order to determine the optimal sequence of actions. Once choosing the last action, we can determine the best second-last action etc. This process continues until received a best action for all states. We compute the value of each state under the optimal policy using the Bellman equations:
[TABLE]
This is process is iterated until it reaches equilibrium which indicates the convergence of the algorithm.
2.1.1 Policy Iteration
Another approach for solving MDP is policy iteration. Policy iteration is a feedback strategy obtained by iterative search in the space of policies. The algorithm is based on two steps: The first step is the evaluation where the algorithm evaluate the values of the states given a set of a action for each state is given by:
[TABLE]
this can be done by solving a set of linear equations. After the values are computed for the given actions, the algorithm makes the second step: improvement. The algorithm considers whether it can improve the policy by choosing a new action for the state. If such action exists, the policy execute the new action.
[TABLE]
The algorithm guarantees that each iteration strictly improves the value of the policy. Therefore, the policy stops when there are no available actions that improve the policy cost. The number of possible policies cannot be more than where is the number of states and is the number of actions. We know that the policy improves at each iteration and the number of possible policies is , thus the algorithm finds the optimal policy within no more than iterations.
2.2 Partially Observable Markov Decision Process
A partially observable Markov decision process (POMDP) is a generalization of the standard MDP, such that the environment is not fully observable, and allows imperfect information about the current state of the environment. In the real world the input may not always be precise where the data may be received with a noise. In robot navigation for instance, the robot will receive its input through sensors which do not describe the environment precisely. Sonar or voice sensors most of the time will probably be a bit noisy and digital video lose information by using a discrete presentation to describe a continuous environment. The POMDP is used as a framework for theoretical decision making and reasoning under uncertainty. Such problems arise in a wide range of application domains including assisting technologies, mobile robotics and preference elicitation. Many of the real POMDP problems are naturally modeled by a continuous states and observations. For instance, in a robot navigation task, the state will correspond to the coordinates in the space and the observations may correspond to the distance measured by the sonar. A common approach to a continuous model requires of discretization and approximation the continuous component of the grid. This usually leads to an important tradeoff between complexity and accuracy with the change of the coarsens of the discretization. On discrete time POMDP, each time period the agent is in some state s, chooses an action a, and receive a reward with expected value. Performing the action, the agent makes a transition to a new state according to some state distribution and observes the environment with a given probability to each state.
Formally, POMDP is an extension of the MDP defined by the tuple where: S is a finite set of state that represents the current situation in the environment. A is a set of actions where the agent choose in each state. T(Transition function) is a function that maps into a distribution over the states . is the probability to reach where the agent is at state s and perform action a. R is the reward function. R maps any into a number which represents the reward or the penalty. The observation function describes the probability of observation o given that action a was performed in state s’ was reached.
Generally, in POMDP we do not know the current state. The only information that is given on the environment is the observations. Therefore, POMDP defines a vector of probabilities b(s) in the size of the state set, called belief state, which specify for each state s, the probability that the environment is in s.
Similarly to MDP, The goal of the POMDP is to construct a policy which maximizes the expected rewards where T is the number of time steps left in a finite horizon, or in an infinite horizon. Since the agent does not know the exact state of the environment, the reward function is given by the belief state i.e. , or in the case of continuous belief space the sum becomes an integral. The belief state of the environment is based on the previous belief state of the environment. Thus, the agent updates the belief b(s’) after being at belief state b(s),choosing action and receiving an observation in the following way:
[TABLE]
using the product rule we get
[TABLE]
When we put it all together we get:
[TABLE]
where
Let L be the number of time
Therefore,
[TABLE]
A generalization on the discrete POMDP is where the space of the belief state is continuous. In this case, we still assume the the actions and observation are discrete, the propagation is defined by the integral
[TABLE]
2.2.1 Value Iteration
Defining the probability update and the reward function for belief state we can can transform the POMDP into a belief state MDP by casting the POMDP problem into a fully observable MDP, where the belief state of the POMDP are reduced to simple state of the MDP. The MDP here is continuous and over -dimensional state space. The transformation allows applying a value function for each belief state according to the Bellman equation:
[TABLE]
This means that the value of belief state b is the reward of taking the best action in b plus the discounted expected reward of the resulting belief state where is the unique belief state computed based on b,a,o as in equation—. Solving the value iteration by dynamic programing will bring optimal solution at the limit, however, the space size over all the belief states that have to be backed up is enormous. Because exact value iteration is intractable, a lot of work has focused on approximate algorithms. One of the most promising approaches for finding an approximate solution point based value iteration (PBVI). In PBVI instead of optimizing the value function over the entire belief state, only specific reachable beliefs are considered. The belief points are selected heuristically and the values are computed only for these points. The heuristic simulate trajectories in order to find reachable beliefs.The success of PBVI depends on the selection of the belief points. In particular the belief points should cover the space as evenly as possible. The set of belief state is expanded over time in order to cover more of the reachable belief state. Adding more point increases the accuracy of the value function.
The key to practical implementation of a dynamic programming algorithm is a piecewise-linear and convex representation of the value function. The reward function as defined above is linear. The exact solution of POMDP is based on Smallwood and Sondik(1973) proof which takes advantage of the fact that the exact solution is piecewise-linear convex functions and can be represented by hyperplanes in the space of beliefs. Each hyperplane is a value function V over real numbers represented by where the value of each belief state is defined as follows:
[TABLE]
Each hyperplane correspond to a single action, and the value iteration updates can be performed directly on these hyperplanes.
2.3 The Canadian Traveler Problem
In the Canadian traveler problem(CTP) [Papadimitriou and Yannakakis, 1991] a traveling agent is given a tuple (G,P,w,s,t) as input where connected weighted graph that consists initial source vertex (), and a target vertex (). The input graph may undergo changes, that are not known to the agent, before the agent begins to act, but remains fixed subsequently. In particular, some of the edges in may become blocked and thus untraversable. Each edge in has a weight, or cost, , and is blocked with a probability , where is known to the agent.111Note that it is sufficient to deal only with blocking of edges, since a blocked vertex would have all of its incident edges blocked. The agent can perform move actions along an unblocked edge which incurs a travel cost . Traditionally, the CTP was defined such that the status of an edge can only be revealed upon arriving at a node incident to that edge, i.e., only local sensing is allowed. In this paper we call this variant the basic CTP variant. The task of the agent is to travel from to while aiming to minimize the total travel cost . As the exact travel cost is uncertain until the end, the task is to devise a traveling strategy which yields a small (ideally optimal) expected travel cost.
A somewhat more general version of CTP is CTP with sensing. CTP with sensing is a tuple , where in this variant, in addition to move actions (and local sensing), an agent situated at a vertex can also perform a Sense action and query the status of any edge . This action is denoted , and incurs a cost , or just when the cost does not depend on . The cost function is domain-dependent, as discussed below. The task of the agent is to travel to the goal while minimizing a total cost .
We further generalize CTP to allow dependencies between edges, and non-binary edge weight distributions. In this general form, CTP-Gen is a 5-tuple where is a graph, is a distribution over weights of the edges , is a sensing cost function, are the start and goal vertices, respectively. The distribution model is over random variables indexed by the edges in , abusing notation we will use the edges in place of the respective random variables. The domain of these random variables are arbitrary weights or cost sets. is usually specified as a structured distribution model over the random variables . Henceforth we assume that is specified as a Bayes network over these random variables, where is the set of random variables, is a set of directed arcs so that is a directed acyclic graph, and are the conditional probability tables, one for each .
We mostly limit ourselves to the binary case where the edges can be blocked (“infinite weight”) or open (some known weight, possibly different for each edge). In these cases, and to simplify the resentation of the distribution, we use a uniform binary domain for the edges, and describe the weight of the (unblocked) edges separately, by a weight function . In the degenerate binary case where is a Bayes network with no arcs (), i.e. all random variables are independent, the problem reduces back to the basic CTP with sensing. In this case we usually specify the distribution as a function , the probability that each edge is blocked.
2.3.1 CTP with Dependencies in Disjoint Path Graphs
As CTP-Gen is extremely complicated, we focus on some special cases w.r.t. the topology of the graph . Specifically, we examine the basic CTP with no remote sensing where is a disjoint-path graph (w.r.t. ). As this case is known to be solvable in closed form in polynomial time, we generalize it to the case where edges are dependent, and edge weights are binary (blocked/unblocked) random variables. Thus we consider CTP-DEP, defined by the 5-tuple where is an undirected CTP graph, is a Bayes network representing the edge blocking distribution model, is a function denoting the edge weights (for unblocked edges), and are the start and goal vertices respectively, as usual.
As we will show, finding an optimal problem for CTP-DEP is intractable even for special cases, and we will thus consider cases where has dependencies only between edges on the same path. Thus the Bayes network representing the distribution model has one (or more) unconnected component, for each set of edges composing a path. We call this simplified variant CTP-PATH-DEP.
In disjoint path graph, we index the edges such that each edges has two indexes where the first index indicates the path and second index indicates the serial location of vertex in the path. For instance are the edges composing the ’th path. Similarly to edges, we index the vertices such that the first vertex indicates the path and second index indicate the serial location of the vertex in the path. are the vertices composing the ’th path. Note that each edge can be represented by
2.4 AND/OR Graphs
Many problems in artificial intelligence can be formulated as a framework for problem solving in a state space search. The AND/OR graph is a directed graph that represent a problem solving process. The solution of AND/OR is a sub-graph of the AND/OR, called solution graph, that is a derivation for the optimal solution of the original problem. In this work, we use the AND/OR graph for finding optimal solutions to probabilistic reasoning problems and CTP in particular. With a slight abuse of notation we use the same notation graph to indicate both CTP graph and AND/OR graph.
Formally, an AND/OR graph is a tuple defines as follows:
- •
where are finite sets of nodes.
- •
is the set of terminal leaf nodes.
- •
is a set of directed edges between the nodes.
- •
is a cost function over the edges and terminal cost.
- •
The graph associate probabilities over the edges such that, for every .
The root node of is denoted by . A policy graph of the AND/OR graph is a subgraph of such that
- •
- •
If is AND node, all its children are in H.
- •
If is OR node then only one of its children is in H.
- •
Every leaf node (node with no children) in is terminal.
and,
- •
If is AND node, all outgoing edges are in .
- •
If is OR node, and is a child of n, then .
Define of node denoted by to be a subgraph of that satisfies all properties of except that the root of is an arbitrary node instead of .
The value of each node is defined as follows:
[TABLE]
The child of OR node with the minimal value is called called preferred son. The cost of the policy graph is defined as . The policy graph is optimal if there are no other policy graphs with lower cost.
We define a policy subgraph to be a subgraph of such that
- •
- •
If is AND node, all its children are in H.
- •
If is OR node then only one of its children is in H.
We define policy subgraph to be a subgraph of that satisfies all properties except that the leafs are not necessarily terminal. The cost of the policy subgraph is defined as . The best policy subgraph is a policy subgraph with the minimal cost in the AND/OR graph.
2.4.1 AO*
The AO* is an heuristic based search algorithm that performs a search in the AND/OR graph for finding the optimal policy graph. The AO* performs a search in the AND/OR graph, gradually building up a partial policy graph, assigning heuristic values to the leaves, and propagating the heuristic values up to the root. The heuristics, used to evaluate the real cost of the nodes in AND/OR graph, are admissible, and therefore, finding the optimal solution is guaranteed. The AO* is beneficial when solving problems with a large state space. The AO* algorithm assumes that the AND/OR graph that represents the problem is not given, however the algorithm construct the AND/OR graph by expanding it each iteration, and thereby develop the optimal policy graph subgraph each iteration. The process ends when all the leaf nodes of the partial policy graph are terminal.
The AO* takes advantage of the fact that once a node is known to be in the optimal policy graph it does not required any further expansion. Thus, the algorithm maintains a boolean parameter called ”SOLVED” for each node in the AND/OR graph which signs the algorithm if node is a part of the optimal policy, i.e. a node n is set SOLVED, performed by the operation MarkSolved(n), if is known to be in the optimal policy subgraph. Once a node is SOLVED, it remains SOLVED. A node is SOLVED if and only if all the nodes in the subpolicy spanned from , are solved. Hence, when a node is set SOLVED, the subpolicy spanned from this node does not require any further update or expansion. Implementing the “solving” process, the AO* performs MarkSolved(n), if node satisfies one of the following:
- •
is a terminal node
- •
is an AND node and all its children are are set SOLVED.
- •
is an OR node and its preferred son is set SOLVED.
Basically, each iteration of the AO* algorithm has two phases: expansion and propagation, described as follows:
- •
Expansion phase:
Trace down the marked edges (directed edges) from and go downwards until reaching a non-terminal leaf node n and expands it. (Finding the expansion nodes requires recurrence exploration through the AND/OR graph since the partial policy graph is changing each iteration.) 2. 2.
For each child of n, if has not been generated, then add it to the policy graph and assign it admissible heuristic. If is a terminal node then assign 0 to its heuristic value, and perform .
- •
propagation phase: In the propagation phase, the heuristic values and marked edges of the expansion nodes are propagated from the leaves onward up to the root. The propagation processed as follows:
- 1
If n is OR node then its heuristic value is updated by,
[TABLE]
The marked edge is directed from n to the child which achieves the minimum in equation 2.3, and n is set SOLVED if and only if is set SOLVED. 2. 2
If n is AND node then its heuristic value is updated by,
[TABLE]
The node n is set SOLVED if and only if all its children are set SOLVED.
The procedure of updating the heuristic values and marking the edges is repeated for all nodes ancestors of n.
Properties of the AO*:
- •
The heuristic values are optimistic estimations (lower bound) to the real value of the state, where each update raises up the heuristic value and reduces its imprecision relatively to the real value.
- •
The AO* is beneficial when it applied to a large state space. One reason for this is that AO* considers only states that are reachable from the initial state. Secondly, the informative heuristic function directs the focus on states that are in the course of a good policy graph(partial policy graph). As a result, the AO* may find an optimal solution by exploring a small fraction of the entire state space.
2.4.2 CTP and AND/OR graphs
AND/OR graph is a natural structure for representing the state space of CTP, where the policy of CTP is represented by the policy graph. The problem solving process is a search for an optimal policy graph in a policy graph space. Here, the OR nodes represents the agent’s decision in a current state out of all its available actions. Where, in basic CTP, the available actions are all the moves available from a certain vertex, while in Sensing-CTP, the available actions are all the available remote sensing actions, in addition to all available moves from a certain vertex(the remote sensing is defined available if it is performed on an unknown edge). The AND nodes represent the actions. Since the CTP is a stochastic problem, each action may result several possible states, which is represented by the AND node’s children. The states of the environment in CTP are represented by the OR nodes. The states are the belief states of the agent in a current time step, where each belief state is represented by its (i.e every belief state is represented by the tuple ). Henceforth, all functions,predicate and lemmas presented in section 3.4 can be applied to the states in the AND/OR graph. We call the set of states that appears in the AND/OR graph the expanded states and denote it by . Although the AND nodes do not represent the states(they are called “semi state”), they maintain heuristic values as described in AO* algorithm which is specified for propagation. Since the environment is static, once an agent observes an edge, its status is remained unchanged. A terminal state is a state in which its location variable is the target(). A node is a terminal leaf node if the state with which it associate is terminal.
Definition 2.4.1**.**
A belief state is expanded belief state if there is an OR node in the AND/OR graph that is associated with .
2.5 Models for the Canadian Traveler Problem
2.5.1 POMDP for CTP
In this section we show that CTP can be modeled by POMDP. Let be an instance of of basic CTP, and be an instance of CTP-with-sensing, where . Given POMDP , we show how I and I’ can be modeled by M as follows:
- •
The state space S. The states space S of I (or I’) represent the possible environment of the world. Each state indicates the location of the agent, and the status of all edges in E. The location of the agent in state s is denoted by where is the vertex that the agent is situated. Each edge is associated with a status variable where indicates that is open in s, and indicates that is blocked in s. Thus, we define the state space S to be .
- •
The action set A. In the basic CTP the set of actions A includes only one type of action in which agent that performs , moves along an edge if is open. While in the CTP-with-sensing, the set of actions A includes in addition to the Move actions, the sensing actions, in which agent that performs , senses an edge . This action can be performed from any vertex .
- •
The transition function Tr. Given , and , we define Tr by the following: if it satisfies the following:
- –
For all edges the status of the edge in s is equal to the state in s’, i.e .
- –
and .
- –
The edge e=() is open in s, i.e. .
Otherwise Tr(s,a,s’)=0.
If where we get if and only if since the Sense action does not change the state of the environment.
- •
The reward(cost) function R. Given , we define R as follows: In case that then,
[TABLE]
In case that then,
[TABLE]
Notation 2.5.1**.**
Let X be a set. Denote the power set of X by .
- •
The observation set Z. Let . We define Z to be the power set of Z’, Namely .
- •
The observation function O. Given , and we define O as follows: In case that where , the only observation that received are the edges incident to vertex , then,
[TABLE]
Where in case that , the only observation that received is the sensed edge e, then,
[TABLE]
- •
is the initial state.
Notation 2.5.2**.**
The optimal policy of is denoted by .
Notation 2.5.3**.**
Let X be a set. Denote the power set of X by .
2.5.2 Belief State for Representing the Environment of CTP
A belief state, which is defined as a distribution over all possible states, is a representation of the agent’s knowledge about the environment. In CTP, the belief states can be represented by the location of the agent and the status of each edge in the graph.
Definition 2.5.4**.**
We say that status of edge is:
- •
“known to be blocked” if has been already sensed and found to be blocked.
- •
“known to be open” if has been already sensed and found to be open.
- •
“unknown” if has not been sensed.
Definition 2.5.5**.**
*Define as follows:
is the edge status of in belief state , where*
[TABLE]
Definition 2.5.6**.**
Define as the location of an agent in a belief state, where outputs the physical location of an agent that is in belief state ,i.e. where is an arbitrary state which satisfies .
Note that definition 2.5.6 assumes that there cannot be two state which satisfy , such that since by definition, the agent always knows its own location, and thus, for every belief state , if exists in which then .
Thus, we can define an alternative way for representing a belief state,
Definition 2.5.7**.**
Let . The form of , denoted by , is defined to be the tuple ,
Definition 2.5.8**.**
Let be a belief state, we define the following sets:
* is the set of all edges in which * 2. 2.
* is the set of all edges in which * 3. 3.
* is the set of all edges in which *
Let be a belief state. Then, there is a mapping from to . Namely, for every there is a mapping from to as follows:
[TABLE]
Where is defined as follows:
if one of the following is satisfied:
2. 2.
3. 3.
otherwise
Corollary 2.5.9**.**
Since there is a mapping from to , we can use the form instead of the belief state itself, for representing the belief state of an agent.
Definition 2.5.10**.**
Let be the probability that edge is blocked given that the agent is in belief state . Namely such that .
In the basic variant of CTP, the probabilities associated with the edges are independent, and hence, as long as , we have .
Corollary 2.5.11**.**
From definition 2.5.10 we get that the probability that edge is blocked given the agent is in belief state is given by:
[TABLE]
2.5.3 Belief MDP for CTP
Given a POMDP of instance of CTP (or of CTP-with sensing). Let be the belief state space of M, we define a belief MDP of I, based on M, where the states space B is over the state space S. CTP is a special case of POMDP(called Det-POMDP) where transition function Tr and reward function R can be simplified here as follows:
- •
The transition function Tr. In general belief MDP, given , , Tr is given by:
[TABLE]
Given , we define to be the set of all edges incident to that are unknown in and known in (the edges that are revealed by the local sensing), i.e.
[TABLE]
Then if and only if,
- –
For all . The status of the edges do not change as well as the information about any unraveled edge that is not sensed.
- –
and .
- –
The edge e=() is open in b, i.e. .The edge has to be open in order to traverse it.
In this case,
[TABLE]
Given , if and only if,
- –
. The edge e’ is known after the performing Sense(e’).
- –
For all . The state is not effected by Sense action and the only information that received is the status of e’.
- –
. The location is not effected by Sense action.
In this case,
[TABLE]
- •
The reward function R. In general, the reward function is defined by . Denote the action cost of by , where if and if . Hence,
[TABLE]
We define . Therefore, if there exist such that . Otherwise . Note that in case that , there always exist reachable from in which , thus always holds.
- •
is the initial belief state.
Definition 2.5.12**.**
We say that action can be performed in belief state if there is a belief state such that
2.6 Related Work
2.6.1 Different Variation of CTP
The Canadian traveler problem is known to be hard [6]. In the lack of approximation solutions, different versions of special classes of graphs have been suggested where the exact solution can be found in polynomial time. [2] have investigated the case of Recoverable CTP , where each vertex is associated with a specific recovery time to reopen any blocked edge that is incident to it. When an agent finds a blocked edge it can either traverse another edge or wait a period of time and check if has been opened. The basic CTP is a special case of the Recoverable CTP where all the recovery times are infinitely large. There are two variation to the Recoverable CTP, deterministic and stochastic. In the deterministic variation the assumption is that the number of edges that may be blocked is bounded. In the stochastic variation, each edge is associated with a probability of being blocked while it assumes that the recovery time is not long relative to the travel time. The two cases were proved to be polynomial in the number of edges and vertexes and in the maximal number of blocked edges. [5] investigated a CTP variant where the environment is dynamic, in a sense that the status of each edge is generated randomly with a given probability whenever the agent reaches an incident vertex of . This variant can be modeled by MDP, where the states represent only the current location of the agent. Since MDP is solvable in polynomial time, this variant is solvable in polynomial time as well. Notice that basic CTP is much harder to solve, since the edges status is remained fixed and thus the state space is exponentially larger (in the number of edges). Nikolova et al. have shown that CTP on directed acyclic graph (DAG) can be solved in polynomial time by using a dynamic programming.
2.6.2 Disjoint Path Graphs
Disjoint path graph is an undirected graph with source and destination such that all paths in G are between and , and these paths are pairwise disjoint. [3] have shown that CTP on disjoint path graph is solvable in polynomial time. The proof is based on the property the the optimal policy is committing . This guarantees that whenever an agent follows a path, the optimal action is to continue the path until reaching the target unless it hits a blocked edge. The optimal policy of CTP on disjoint path is to follow the paths by their order of ( is parameter associated with each path in G) Meaning, the optimal policy is to travel the path with the minimal till reaching the target unless the path is blocked. If the path is blocked then return to and travel the path with second minimal and so on. is defined as,
[TABLE]
Where denotes the backtracking cost of path i which is the cost of traversing path i until hitting a blocked edge and then returning back to the when the path is not traversable , or 0 when the path is traversable. The expected cost of is
[TABLE]
Where .
Another variation of CTP on disjoint path graphs is when the edges cannot be blocked but instead have two possible finite costs: a cheap and and expensive [5]. A simple case of this variation is when the value edges is binary, i.e., 0 or 1. In this case the optimal policy would be to explore all the edges with cost 0 of each path until it reaches an edge with cost 1 on the path, and then return to the path with the fewest unexplored edges and follow it until reaching the target. A more general case of this variation is when the edges are associated with the cost 1 or K. In this case the optimal policy has the property that once an edge with cost has been crossed, it is optimal to continue along the same path until reaching the target. Taking advantage of the special structure of the policy induced by this property, allows to define an MDP with concise representation that decides in what order to explore the paths and how many, before committing a path. This two cases were proved to be solvable in polynomial time.
2.6.3 CTP with Sensing
The CTP with sensing is a harder problem than the basic CTP since a simple reduction can be constructed from any instance of CTP: The graph of the basic CTP is the graph of the CTP with sensing, however, the sensing cost of all edge are large enough, such that sensing an edge is never worthwhile. As such, the expected cost of the two optimal policies is equal.
Heuristic search algorithms
In order to facilitate the search for solution of CTP with sensing, some heuristic based algorithm have been suggested. The algorithms do not provide an optimal solution, however, they may be much simpler. [3] have suggested the FSSN algorithm that is based on the free space assumption heuristic. The free space assumption [4] assumes that edges are traversable unless specifically known otherwise. The FSSN plans a path from some vertex to with the shortest path under the free space assumption. The agent can either attempt to traverse P without sensing or may decide to interleave sensing actions into the movement actions, according to a sensing policy that is embedded in the algorithm.
Number of sensing policies to the FSSN have been suggested:
Never Sense is a brute force policy that never senses any remote edge. This policy never incurs any sensing cost but it may lead to an increase travel cost.
Always Sense is a brute force policy that senses all the unknown edges in the path before it moves along it.
Value of information a policy that decides what edges to be sensed according to their value of information.
2.6.4 Propagating AO*
AO* harness the benefits of the heuristic search to avoid searching states that are undesirable. However, in many situations AO* examines far more states than necessary. Propagating AO*(PAO*) [1] is an extension of the AO* that takes one step forward for facilitating the search. PAO* propagates the heuristic values on a larger scale in which minimizes the expansion of non-terminal nodes. PAO* is based on a specific variation of the AO* algorithm; Ferguson et al. constructed an algorithm that solves a variation of the CTP where most of the graph (edges) is observable such that only a single unknown edge (called pinch point) can be incident to a vertex (In the original paper the pinch points are called “faces” ). As such, any chance node (AND node) has at most two children that represent a traversable edge and a blocked edge. PAO* is described as follows: The expansion phase is processed exactly as the AO* where the PAO* grows the best partial policy graph by expanding the non terminal leaf nodes, and assigning heuristic values to its children. Similarly to AO*, PAO* propagates the heuristic values onward up to the root. However, PAO* propagate the heuristic values sideways and downwards to the children as well. Furthermore, the algorithm takes advantage of the fact that the AND node has only two children (traversable and blocked) such that the parent node heuristic value should never be less than the traversable child value. Thus, PAO* propagate the heuristic value of the parent to the traversable child if the heuristic value of the traversable child is higher. Similarly, the heuristic value of the parent should never be greater than the value of the blocked child. Therefore, PAO* propagate the value to the non-traversable child in case that the heuristic value of the parent is higher.
Chapter 3 Theoretical Analysis of CTP
3.1 CTP with Dependencies
Theorem 3.1.1**.**
CTP with dependencies is at least as hard as CTP with sensing.
Proof outline: By reduction from CTP-with-sensing to CTP-with-dependencies.
Proof:
Proof. Let I=(G,W,C,SC,s,t) be an instance of CTP-with-sensing. We construct an equivalent instance I’=(G’,W’,C’,s’,t’) of CTP-with-dependencies and show that there is a one-to one equivalence between I and I’. Construction of I’ is as follows, G’ contains G entirely, and in addition, each vertex in G is attached to two-edge dead-end path, that simulate the sensing operation of I. One path for each possible sensing operation in I.
Formally, the construction of I’ is as follows:
First, we construct by copying the graph G(V,E) using the following functions:
- •
is a bijection function that copies V into such that for each , is the copied vertex of .
- •
is a bijection function that copies E into such that for each , is the copied edge of .
Let be the set of all the vertices that were copied from V, meaning . Let be the set of all the edges that were copied from E, meaning .
Notation 3.1.2**.**
* denote the copied vertex .*
Notation 3.1.3**.**
* denote the copied edge *
We construct a new graph G’(V’,E’) by extending and using the following functions:
- •
and are one to one functions that generates a vertex for each element in . Meaning, given and , .
- •
and are one to one functions that generates an edge for each element in such that given and , and in addition, and .
Let graph defined as follows:
[TABLE]
where,
[TABLE]
Notation 3.1.4**.**
Given , we define a two edge dead end path .
Note that can be viewed as ”attachment” of paths to each .
is the Bayesian network that is associated with edges in G where X is the set of nodes and Y is the set of arcs. Similarly is the Bayesian network that is associated with edges in G’ where X’ is the set of nodes and Y’ is the set of arcs. Let x be a node in X, and x’ be a node in X’. We define W’ by W as follows:
- •
For each , ().
- •
For each , , i.e. all edges in are open.
- •
For each j, , i.e. for each j, all edges are open if and only if is open.
The weight function is defined by:
. 2. 2.
where denote the sensing cost of . 3. 3.
.
The computational time that takes to generate this reduction is polynomial, since the size of , where and , therefore . Furthermore, the size of is since each node in X’ is associated with an edge in E’ , and since each node in X’ that is associated with edge in is connected to nodes.
Let be a POMDP that modes I, where are finite sets of states, actions, observations, transition functions, observation functions and reward functions respectively. Similarly, let be a POMDP that models I’ where is the set of observations in I’, is special subsets of the states set in I’, and is a special meta-action set in I’(a set of series of action in I’) which will be defined later. are the transition functions, observation functions and reward functions in I’. Let be the optimal policy of I and be the optimal policy of I’. In order to prove theorem 3.1.1, it is suffice to show that . In the remainder of this proof we prove this property by showing that is equivalent to and that actually models I’.
We want to define the subset that contains all states in which the agent is located in a ”‘copied”’ vertex. Formally,
Definition 3.1.5**.**
*Given , we define to be the subset of S’ such that if and only if and . Meaning . *
Lemma 3.1.6**.**
Let be the location space of , (i.e. ) then .
Proof.
. According to , for every if .
. V’ contains all possible locations that agent can be in G’, where is subset of V’ which contains all possible location in . Thus every element in is in ∎
Corollary 3.1.7**.**
* is the location space of .*
Now, we want to show that and are equivalent in a sense that there exist a one-to-one correspondence between and . In order to show that we need to make the following definitions and statements,
Definition 3.1.8**.**
*We define EStatusSet(S) be the set of of all elements , meaning . *
In fact, is the set of all the possible status vectors of E, as such .
Lemma 3.1.9**.**
There exist a one-to-one correspondence between V and
Proof.
Since is a bijection, the exist a one to one correspondence between and ∎
Lemma 3.1.10**.**
There exists a one-to-one correspondence between and .
Proof.
There exist a one to one correspondence between and since for all , each element can be mapped into a different element . This is due to the following facts:
- 1
(injective)According to definition of W’, , in other words every edge has equal status as its copied edge and thus there exist a one to one correspondence between each set of edge status and each set of edge status (In fact for each ). Since for every , is a subset of , there exist a one to one mapping between and 2. 2
(surjective) The status of edges is completely determined and unique, given edges status of edges ,
i.e. . In particular, there exist exactly one element in with a given edges status , since each variable associated with edges in depends completely on variables associated with edges in () and the status of all edges in are predetermined to be open ().
We are left to show that there exist a one to one correspondence between and . Since the location of the agent is independent to the edges status, we can represent as a cartezian product (According to corollary 3.1.7 is the location space of ) but according to definition 3.1.5 we can represent as , hence . Therefore, there exists a one-to-one correspondence between and . ∎
Lemma 3.1.11**.**
There exists one-to-one correspondence between and .
Proof.
According to lemma 3.1.9, there exist a one-to-one correspondence between V and . According to lemma 3.1.10, there exists a one-to-one correspondence between and . Since and , there exists one-to-one correspondence between and . ∎
Definition 3.1.12**.**
*Let , , and . We define in I’ to be the equivalent meta-action to action () if and only if:
a_{2}=\left\{\begin{array}[]{ll}move(\hat{v}_{i},\hat{v}_{j})&\quad if\;a\in move(v_{i},v_{j})\\ move(\hat{v}_{i},v^{\prime}_{ij1}),move(v^{\prime}_{ij1},\hat{v}_{i})&\quad if\;a\in sense(v_{i},e_{j})\end{array}\right.*
Definition 3.1.13**.**
We define to be the set of all equivalent actions of actions in . Meaning
Definition 3.1.14**.**
*We define the set to be the following:
\tilde{st}_{e_{i}}=\left\{\begin{array}[]{ll}\{O_{g_{e}({e}_{i})},O_{e_{1i2}},O_{e_{2i2}},...,O_{e_{ni2}}\}&\quad if\;st_{e_{i}}=O_{e_{i}}\\ \{B_{g_{e}({e}_{i})},B_{e_{1i2}},B_{e_{2i2}},...,B_{e_{ni2}}\}&\quad if\;st_{e_{i}}=B_{e_{i}}\\ \end{array}\right.*
Definition 3.1.15**.**
*Let be a set of observations in I’ and be a set of observations in I. We define is the equivalent observation of (denoted by ) if and only if:
[TABLE]
Lemma 3.1.16**.**
The cost of action in is equal to the cost of the equivalent meta-action in .
Proof.
- 1
(by definition of the weight function). 2. 2
, and . Therefore, .
Lemma 3.1.17**.**
Given , and such that , then .
Proof.
- 1
In case that . Let (the edges incident to ) and let . By definition of CTP, if then the agent observes (the pre-known edges incident to which are revealed by the action). Therefore if and otherwise. The equivalent action of is , hence, taking , the agent directly observes , but in addition, according to definition 3.1 , hence the agent also indirectly observes edges . Thus, the agent’s overall observation is . Since , by definition 3.1.15 if then . Hence if and otherwise. Thus, in this case 2. 2
In case that the agent observes therefore if and otherwise. Since , where agent observes directly and observes indirectly(the same cause as in case 1). Thus, the agent’s overall observation is . Similarly to case 1, since , by definition 3.1.15 if then . Hence if and otherwise. Thus, in this case as well.
Lemma 3.1.18**.**
Given , and such that action is taken in and meta-action is taken in , then if then .
Proof.
WLOG, let , and . Since we get and .
- •
WLOG, in case that . If we get otherwise . Furthermore, if then which incurs , otherwise . Thus, in case that we get .
- •
WLOG, in case that . Since the sense action does not change the location of the agent we get . Since . In this case since the agent return to it original location . This incurs and thus .
∎
Lemma 3.1.19**.**
Given states then if and then .
Proof.
WLOG, let . Given that then .
- •
In case that , . If then . Since we get .
- •
In case that , . If then . Since we get .
∎
Lemma 3.1.20**.**
* is equivalent to .*
Proof.
We have shown that there exist a one to one correspondence between and . By defining the set which consist of equivalent action in , and by defining the set which consist of equivalent observations in , we have shown that functions , , and when generated on equivalent set of states, observation and actions. ∎
Lemma 3.1.21**.**
* models the problem of I’.*
Proof.
Here we show that although models a subproblem of I’ ( is defined on subsets of states, actions of I’), it actually models the exact problem of I’. For every state , where . An agent located in can only move to . In addition, in order that agent would be located in it has to move from . Thus replace the two move actions in to one meta-action and thus we can reduce the state set of S’ into the subset . Therefore, models I’. ∎
3.2 CTP-Forward-Arcs
Definition 3.2.1**.**
Let be a disjoint paths graph of CTP-PATH-DEP and be its associated Bayesian network. Let be the associated node of edges and (note that the edges are in the same path in G). Then the arc is Forward-Arc if , i.e. if is closer to s than .
Definition 3.2.2**.**
CTP-Forward-Dependency(CTP-FOR-DEP) is a special case of CTP-PATH-DEP such that all the arcs in W are Forward-Arcs.
Theorem 3.2.3**.**
CTP-FOR-DEP is solvable in polynomial time.
Proof outline. CTP on disjoint paths graph with independent distribution over the edges(CTP-PATH-IND) is shown to be solvable in polynomial time [Bnaya, Felner and Shimony, 2009]. We show that we can transform CTP-FOR-DEP into an instance of CTP-PATH-IND with new distribution over the edges such that the optimal policy of the new CTP-PATH-IND can be applied to CTP-FOR-DEP.
Proof.
Let be an instance of CTP-FOR-DEP. We construct a new instance of CTP-PATH-IND by constructing a new Bayesian network W’(X’,Y’) of I’ such that
- •
. In other words W’ is “arc free” where each node is an independent component in the BN.
- •
.
Let be a belief state MDP of I, where B is the set of belief states , A is the set of actions, Tr is a set of transition probabilities, R is the reward function. We construct a new belief state MDP of I’ where B’ is the set of belief states, A is a set of actions which is common to the set of action in I (since it refers to the same graph G), Tr’ is a set of transition probabilities, and R’ is the reward function .
Definition 3.2.4**.**
Let be a function defined as follows: Let and such that then .
Notice that is well defined since there is a one to one mapping from to and from to .
Lemma 3.2.5**.**
Let be reachable belief states in B and let be an action. Then
Proof.
Let where and let
- 1
In case that (i.e. action not performable in ), then . If then one of the following cases must satisfied:
- •
Edge is not adjacent to the location of the agent in , i.e. . If then since . Thus .
- •
Edge is not adjacent to location of the agent in , i.e. . If then , since . Thus .
- •
Edge is blocked in belief state , i.e. . If then since . Thus .
- •
There exist an edge such that . Since and , there exist an edge Since is blocked in belief state , i.e. . If then , since . Thus . 2. 2
In case that (i.e action is performable in ) then edge has to be open and one of the following cases must satisfied:
- •
Edge is Open in b (i.e ). If then the status of all edges in must be the same as in , i.e. () since the agent does not sense any unknown edge when performing and hence . If then and the status of all edges in must be the same as in from the same reasons as before. Hence , and we have .
- •
Edge is Unknown in b (i.e ). Since W is the belief network of CTP-FOR-DEP and b is reachable from , the status of all edge have to be Open (In order to reach all edges in path i from s to must be traversable). Thus,
[TABLE]
In addition (since ). There is no dependencies in W’ (i.e. ). Therefore,
[TABLE]
However, by definition of X’ we have,
[TABLE]
Hence,
[TABLE]
∎
Lemma 3.2.6**.**
Let be reachable belief states in B and let be an action. Then .
Proof.
From definition 2.5 it follows that:
if and only if .
But we proved in lemma 3.2.5 that
Thus, . ∎
Definition 3.2.7**.**
We define the predicate to be true if and only if is reachable from in belief-MDP M. i.e. there exist such that
Lemma 3.2.8**.**
Let . Then is true if and only if is true.
Proof.
Follows from definition 3.2.7 and lemma 3.2.5. ∎
Definition 3.2.9**.**
Define the set to be the set of all belief states that satisfy . Namely,
[TABLE]
Next, we define an analogue set for B’,
Definition 3.2.10**.**
Define the set to be the set of all belief states , for , that satisfy . Namely,
[TABLE]
Let be a belief MDP over belief state and let be a belief MDP over belief state
Lemma 3.2.11**.**
* and are isomorphism.*
Proof.
Follows from definition 3.2.7 that is a bijection over and . In addition, F preserves the function Tr , Tr’ lemma 3.2.5 as well as R,R’ lemma 3.2.6 . ∎
Corollary 3.2.12**.**
Let be the optimal policy of I, and be the optimal policy of I’. Then for every reachable belief state we have .
Proof.
Since and are isomorphism, the problems are equivalent and their optimal solutions are equivalent. ∎
Therefore, we can transform any instance of CTP-FOR-DEP into an instance of CTP-PATH-IND, apply the algorithm which solves CTP-PATH-IND in polynomial time, and equivalent optimal solution is guaranteed (corollary 3.2.12).
Now, we show that determining the probability for all nodes can be computed in polynomial time. We use the Bayesian theorem to get:
[TABLE]
We use the chain rule to get:
[TABLE]
The variables in the Bayesian network are topologically ordered by their order in the path and hence each probability can be iteratively computed given that its ancestors values have already been determined(using equations 3.2,3.2). Therefore, inferring the probability of each edge takes linear time and inferring the probability of all edges takes . Thus computing the optimal policy takes polynomial time. ∎
3.3 CTP-PATH-DEP
Definition 3.3.1**.**
CTP-PATH-DEP is a special case of CTP-DEP where the associated Bayesian network has dependencies only between edges on the same path.
Theorem 3.3.2**.**
CTP-PATH-DEP is NP-hard.
Proof outline By reduction from 3-SAT to CTP-PATH-DEP.
Proof.
Let be a set boolean variables . Let the 3CNF formula be a conjunction of the clauses where each clause is a disjunction of three literals and for each literal it holds that or . We construct the instance of CTP-PATH-DEP from F, such that F is satisfiable if and only if the expected cost of the optimal policy is greater than some given constant. I is defined as follows: is a graph consisting two disjoint paths , where
(The edges are ordered from the edge incident to s to the edge incident to t). Edges correspond to clauses respectively, and edges correspond to variable respectively. The correspondence will be define later in this proof. 2. 2.
consist of a single edge .
w is the weight function over the edges, is defined by:
- •
- •
- •
for all other edges.
is a Bayesian network.
Definition 3.3.3**.**
For every edge in path , we define the variable to be the variable corresponded to edge such that if and only if is Open.
The set of node of is a union of the following sets:
- •
. is a set that contains the single variable .
- •
. is a set that contains the single variable .
- •
. is a set that contains all nodes that correspond to variables .
- •
. is a set that contains all nodes that correspond to variables ..
- •
. is a set that contains all nodes that correspond to variables ..
Namely, .
The arcs in are defined by the followng sets:
- •
. An arc from node to node .
- •
. An arc from node to node .
- •
. An arc from node to node .
- •
. A set of three arcs from each variable node (for ) to clause node such that is the variable corresponding to literal . For instance then , and .
- •
- Arc from each node to a corresponding node
- •
- Arc from each node to node
The condition probabilities of W are as follows:
- 1
( is an independent variable). 2. 2
For every variable node it holds that , i.e. if then path is always open with probability 1. 3. 3
Given , W is specified as follows:
- (a)
2. (b)
3. (c)
4. (d)
The reduction maps each variable of F to a variable of W such that,
- •
Each boolean SAT variable is mapped to a binary variable in the Bayes network , such that if and only if
- •
Each clause is mapped to binary variable such that if and only if .
Lemma 3.3.4**.**
Given , then F is satisfiable .
Proof.
If then F is satisfiable in addition, and . Thus, ∎
For simplicity is denoted by CY and is denoted by CL. Note that .
The construction of the reduction is computable in polynomial time since the graph G contains vertices, edges and the Bayes network W contains nodes and arcs. In addition, function , which maps each variable in to variable in , is computable in polynomial time as well.
The optimal policy is committing in a sense that after the agent chooses a path, it keeps following this path until reaching t, unless agent hits a blocked edge. This is caused due to the fact that if agent chooses to traverse first, after traversing the first edge , it is optimal to keep following toward since the rest of the edges in are 0, and thus if is traversable, no extra travel cost is paid. On the other hand, if is not traversable then the agent pays extra regardless of how many edges did it traversed in . Therefore the decision problem of the optimal policy here is simply whether to choose as a first path to try or .
Notation 3.3.5**.**
Let denote a committing policy that chooses as a first path to try, and denote a committing policy that chooses as a first path to try.
Lemma 3.3.6**.**
Let C be a constant, such that , where k is the number of models in F and n is the number of boolean-SAT-variables in F. Let be the optimal policy of I. F is satisfiable if and only if
Proof.
Suppose that F is satisfiable. The probability that is open is
[TABLE]
by construction of W:
[TABLE]
The probability since there are k sets of literals of F such that its instantiation gives F=true, and the domain size is the number of all possible instantiations to , which equals . Thus,
[TABLE]
[TABLE]
Let PY denote the probability .
Now, we want to calculate the probability that path is open given that is open.
[TABLE]
According to W:
[TABLE]
if then is blocked
[TABLE]
Setting equations 3.8,3.10,3.11 in equation 3.3 gives:
[TABLE]
Denote to be the sum cost of all edge in and to be the sum cost of all edge in . The expected cost of the policy when choosing first path is
[TABLE]
Note that in case that the agent traverses and is blocked, the agent hits a blocked edge and is forced to pay another CY extra, when the agent moves backward to s. The expected cost of is simply CL. Since , the optimal policy is and . It is given that , therefore if F is satisfiable then .
Suppose that F is not satisfiable. Now, the calculation of the probability is easier because we know that the only case where is open is when .
Therefore,
[TABLE]
[TABLE]
According to equations 3.11,3.14,3.15,3.16
[TABLE]
Thus if is open then is open. The expected cost of is
[TABLE]
Again the expected cost of is CL. Since the optimal policy is and therefore . It is given that . and thus if F is not satisfiable then ∎
∎
3.4 Theoretical Properties of Belief-MDP for CTP
In the following section, we are given an instance of CTP, where . We construct a belief state MDP of I, where S is the state set of I.
Definition 3.4.1**.**
Policy is called finite if the AO-graph for is acyclic(DAG).
Notation 3.4.2**.**
Denote the expected cost of the optimal policy of in belief state as ; namely, .
If the AO-graph for policy is acyclic then is finite [Bonet, 2010]. By definition, there is a traversable edge in G. Therefore, there is a policy with finite cost and hence is finite [Bonet, 2010]. It should be noted that all policies referred to this section are finite.
Definition 3.4.3**.**
The predicate , defined over , is true if and only if the following properties are satisfied:
** 2. 2.
For all ,
- •
* if and only if .*
- •
* if .*
- •
* if or if .*
The predicate indicates that “ is at least as blocked as ”, meaning if the pair satisfies then .
Let . We demonstrate by the following table:
[TABLE]
Definition 3.4.4**.**
Let . Define the predicate to be true if and only if the following properties are satisfied:
** 2. 2.
For all ,
- •
* if and only if .*
- •
* if .*
- •
* if or if .*
Intuitively, means that “ is at least as open as ”, where the set of known open edges in is contained in the set of known open edges in .
We demonstrate by the following table:
[TABLE]
Definition 3.4.5**.**
We define the function as follows: Let such that , then is defined by the following(by its elements):
. 2. 2.
For all . 3. 3.
For all
Note that by corollary 2.5.9, can be determined from . The function is called since it “blocks” all the edges in (i.e. for every edge the function “changes” the status of edge in to ) where all the other element in are remained unchanged in .
For example, we are given belief state such that and . Hence, if then
Property 3.4.6**.**
For every belief state and a set of edges we have .
Proof.
Follows immediately from definition 3.4.5. ∎
Definition 3.4.7**.**
We define the function as follows: Let such that , then is defined by the following(by its elements):
. 2. 2.
*For all . * 3. 3.
For all
The function is called since it “open” all the edges in (i.e. for every edge the function “changes” the status of edge in to ) where all the other element in are remained unchanged in .
For example, we are given belief state such that and . Hence, if then
Property 3.4.8**.**
For every belief state and a set of edges we have .
Proof.
Follows immediately from definition 3.4.7. ∎
Definition 3.4.9**.**
We define the function as follows: is the set of all belief states such that . Meaning .
Note that the function is somehow a generalization of an inverse function in a way that for every and we get .
For instance, let where and such that . Then such that,
- •
- •
Definition 3.4.10**.**
*Let . The equivalence relation is defined as follows:
The belief states satisfy if and only if .*
Definition 3.4.11**.**
Let be a function. is defined to be the set of all edges incident to which are unknown in and known in , i.e.
[TABLE]
Definition 3.4.12**.**
Let be a set of finite policies over B. We define the function as follows: For every pair of belief states such that , the policy satisfies .
Definition 3.4.13**.**
We define the function as follows: is the set of all possible belief state that can be reached from belief state immediately after taking action . Meaning
[TABLE]
Definition 3.4.14**.**
Let and . Define
Lemma 3.4.15**.**
Let such that . Let such that . Let such that , then .
Proof.
Let . If then can be performed in any belief state. However, if then, for every , can be performed in if and only if and . By definition 3.4.3 we have . All belief state in B are consistent with a given realization(all belief states in B describes the knowledge about the same environment), and since is known in (), we get if and only if . Thus an agent in can perform if and only if an agent in can be perform . But we are given that , hence can be performed in as well. Let and let . We are left to show that the status of all edges in is equal in and in i.e. for all . By definition 3.4.3 we have , and thus . Thus, by definition of DiffEStatus, the status of all edges in , is equal in and in . This satisfies all conditions for having . ∎
Lemma 3.4.16**.**
Let such that . Let and . Let and such that . Then, .
Proof.
Let . By definition of , , hence, by lemma 3.4.15, . Let , we define the probability by,
[TABLE]
By definition of transition function . For every , where , define . For every , , hence . We define the probabilities as follows,
[TABLE]
By definition of transition function, .
Summing up over all gives,
[TABLE]
- is equal for all .
** (The sum of all marginal probabilities equals 1).
For all we have , hence .
Thus,
[TABLE]
∎
Theorem 3.4.17**.**
Let such that . Then .
Proof outline: We prove that for every finite policy there is a finite policy such that .
Proof.
By induction. Let be the subset of all edges such that is unknown in and blocked in . i.e.
[TABLE]
Let and where . Let be a finite policy and let be a policy defined as follows:
For every belief state . Meaning, maps every belief state to an action by simulating on belief state and output . Clearly, an agent acting according to will never traverse any edge in . We show by induction that as follows,
- •
Base case: If are terminal states then (by definition of terminal states).
- •
Assume by induction that for every and we have . By definition of we have . Let . Since we have Hence, according to bellman equations,
[TABLE]
In order to show that we show the equivalence in the right sides of the equations above.
Given then,
- –
(action Sense is always performable).
- –
. Let be the belief states that are reached immediately after the agent has sensed in respectively, and e was found to be blocked. By definition of transition function, . By assumption of induction, . Hence, . Similarly, let be the belief states that are reached immediately after the agent has sensed in respectively, and e was found to be open. Then . By assumption of induction, . Hence, .
Thus,
[TABLE]
Where denotes and denotes (recall that as well as are interchangeable).
Given , where , then,
- –
. By definition of reward function, for every if and only if if and only if and . Thus, if and only if and . From definition 3.4.3 it follows that . In addition, if and only if , due to the following:
All belief states reachable from , and in particular the belief states , are referred to the same unknown given environment . Hence, only if or and similarly only if or . 2. 2.
, . Since we have if and only if . By definition, an agent located in vertex , knows the status of all edges incident to v. Thus , .
- –
.
Let be a partition of B by the equivalence relation such that without loss of generality for every . Let for every . Then, by assumption of induction, for every , we have , hence as well.
According to lemma 3.4.16, summing up over all gives,
[TABLE]
Hence,
[TABLE]
Thus, summing up over all transition functions gives,
[TABLE]
This completes the induction proof.
We have shown that for every finite policy we can define a finite policy which satisfies . Since the optimal policy is also finite, the equation holds. Thus, in general . ∎
In figure 3.4, we demonstrate the “simulation” of policy presented in theorem 3.4.17. Here, we are given a graph G=(V,E), where (abusing notation, we denote one vertex as s, and one as t),
, where w, which is noted with each edge in the figure, represents the edge weight. In addition, two belief states are given with the following forms:
.
On the upper left of the figure, the edges status are based on and on the upper right the edges status are based on , where the green lines represent open edges, black lines represent blocked edges, and red lines represent unknown edges. Notice that satisfy and thus . The lower figures represent the execution of policy on where and on . We see the equivalence between the policies(the same sequence of actions). Notice that agent acting according to (as shown by the doted line), does not perform the action although it is optimal, since treats all edges in as blocked in (edge in this figure)
Corollary 3.4.18**.**
Suppose that is true for , then if is an admissible heuristic(optimistic) of , then is an admissible heuristic of as well.
Proof.
Follows from theorem 3.4.17 ∎
In the following statement, we use theorem 3.4.17 to show that if then is a lower bound of .
Lemma 3.4.19**.**
Let such that . Then, .
Proof.
Let such that differs only by the status of edge , where is Unknown in , Open in and Blocked in . We prove that .
- •
By the law of total probability we can express as follows:
[TABLE]
From lemma 3.4.3 we get that . Thus, there is such that,
[TABLE]
We can express equation 3.26 as follows:
[TABLE]
Substracting from both sides and then dividing both sides by , we get:
[TABLE]
Since we get
[TABLE]
Trivially, it can be shown by induction that , for any set of edges such that edges in are unknown in and open in .
∎
Corollary 3.4.20**.**
Suppose that satisfy , then if is an admissible heuristic(optimistic) of , then is an admissible heuristic of as well.
Proof.
Follows immediately from lemma 3.4.19 ∎
In the rest of the section we provide some new definitions and lemmas in order to prove another lower bound to the cost of optimal policy on belief state by a cost of the optimal policy on another belief state where, in contrast to the previous lemmas, the locations of the agents in and in are different.
Definition 3.4.21**.**
Let . We define the predicate to be true if and only if and for every edge .
Definition 3.4.22**.**
Define the set to be the set of all pair such that . Meaning . We call the DiffLoc of B.
Definition 3.4.23**.**
Define the set to be the set of all edges that are known to be open in belief state . Meaning
Definition 3.4.24**.**
Let be the set of all paths in G and let be the DiffLoc of B. We define the function such that for is the shortest path between and in graph .(Note that is a subgraph of G=(V,E) since )
Note that for every , since the status of edges specified by and are equal.
Definition 3.4.25**.**
Let be the set of all paths in G. We define a path cost function as follows: Let be a path, then .
Definition 3.4.26**.**
We define the set to be the set of all known edges in belief state . Meaning . is called the knowledge in b.
Lemma 3.4.27**.**
The value of information in the Canadian Traveler Problem is never less than zero.
Proof.
Let such that is reached from immediately after performing SENSE(e) (Suppose hypothetically that an agent in is allowed to perform action SENSE(e) once, on any edge , with no cost) and we get . Hence, this lemma is true if and only if (by definition of value of information). Since , we can simulate any policy of on by “ignoring” the information received from SENSE(e) in and in particular the optimal policy . Therefore, we can define the policy such that for every belief state reachable from and belief state where . Since and are referred to the same physical environment, the execution of on will be equal to the execution of on (will produce the same sequence of actions). Hence, , and in general . ∎
Lemma 3.4.28**.**
Let such that , then .
Proof outline: In the next lemma we show that . This gives us a lower bound to since . We show this by defining a policy such that when executing on we have the following: An agent under moves through the shortest path(under assumption that all unknown edges in are blocked) to the location referred by , reaching belief state b’, and then under the agent is followed by the execution of the optimal policy .
Proof.
Assume(by negation) that
[TABLE]
Let and . We define a new policy such that executing it on gives the following:
An agent under traverses the path straightforward from to (which is always possible since all edges in p are open). Let be the belief state that the agent reaches when arriving . 2. 2.
Immediately after reaching , the agent under acts according to the optimal policy until reaching t. Meaning, for any belief state b” reachable from .
Clearly,
[TABLE]
We claim that . This result from:
. The knowledge in and is equal by definition of element pairs . 2. 2.
. An agent in that follows the shortest path p may obtain information if a vertex in path p(a vertex that is incident to two edges in p) is incident to an edge that has not been sensed yet.
Since an agent A1 in and an agent A2 in are at the same physical state , and the knowledge of A1 about s is a subseteq of the knowledge of A2 about s we get (followed by the lemma of value of information).
Thus,
[TABLE]
by equation 3.28 we get,
[TABLE]
Following assumption 3.27 we get,
[TABLE]
which is a contradiction to the optimality of policy . Hence,
[TABLE]
∎
Corollary 3.4.29**.**
Let such that , then if is a lower bound to , then is a lower bound to as well.
Proof.
Follows immediately from lemma 3.4.28 ∎
We want to define two relations which will be used in the next section:
Chapter 4 Generalizing PAO*
4.1 General Propagation AO*
In many cases, PAO* lowers dramatically the running time by reducing the state space, however, it assumes that each vertex is connected to at most one unknown edge, such that each AND node in the AND/OR graph has at most two successors. We present the generalized propagation AO* algorithm (Gen-PAO in short), a generalization of PAO*, which does not assume any preknown knowledge of the graph (except the edges incident to which are always defined as Open). Gen-PAO solves the Sensing-CTP as well. Each sensing action is associated with a sensing AND node, where each sense node has only two children nodes for the two possible statuses of the sensed edge (Open/Blocked). This variant is extremely harder than the basic CTP since the agent can sense any unknown edge in any state and hence, the branching factor of the OR nodes is significantly larger.
4.1.1 Gen-PAO Heuristics
Similarly to AO* and PAO*, each iteration of Gen-PAO is based on two phases: Expansion and Propagation. Gen-PAO differs from AO* and PAO* only in the Propagation phase (i.e, the Main and Expand method as presented in algorithm 2.1 are part of Gen-PAO as well). However, in the propagation phase, Gen-PAO propagates the heuristic values not only upwards to the ancestors as AO*, but to the entire state space, incorporating three novel heuristics: HBlocked, HOpen, and HDiffLoc (line 13). The heuristic HBlocked is based on the predicate MoreBlocked (definition 3.4.3), HOpen is based on the predicate MoreOpen (definition 3.4.4), and HDiffLoc is based on the predicate Diffloc(definition 3.4.21). Let be the set of belief states that expanded by Gen-PAO and let . The heuristics are defined as follows:
- •
HBlocked(b):* If there is a belief state that satisfies and then *.
- •
HOpen(b):* If there is a belief state that satisfies and then *.
- •
HDiffLoc(b):* If there is a belief state that satisfies and then *, where CSP is the cost of the shortest path from b to b’.
Belief states of which values are updated due to propagation from b are called propagated belief state of b. Notice that , and always raise up the heuristic value of the propagated belief states of b. However, due to corollaries , and 3.4.29 respectively, heuristics , and are admissible, and thus they are upper bounded by .
Figure 4.1 illustrate an update of belief state by the three heuristic methods (the the new heuristic values of are notified in parenthesis)
The heuristic methods are invoked when a value of belief state is updated(procedure Propagate, line 13). The heuristics methods are ineffective on a major part of the expanded states (i.e. most of the expanded states do not satisfy the predicates , and , for a given expanded state , and their values are not updated by their compatible heuristic methods). In order to reduce the number of expanded states that are checked for update, we use two data structures: and . For defining these structures we define new equivalence relations: The equivalence relations and on belief states are defined as follows:
Definition 4.1.1**.**
* if and only if:*
** 2. 2.
**
Similarly,
Definition 4.1.2**.**
* if and only if:*
** 2. 2.
**
Each of these structures is a hash table that contains the entire expanded state space (more precisely the hash table refers to ), where the entires of each table divide into equivalence classes called “buckets”. Namely, the set of buckets , in , partitions in to equivalence classes by the relation , while the set of buckets , in , partitions into equivalence classes by the relation . By definition above, , never updates the heuristic value if b and b’ do not share the same bucket of , and similarly, never update a value of b’ if b and b’ do not share the same bucket of .
Procedure (Algorithm 4.2) implements the heuristic . (line 1) returns the set of all expanded nodes whose belief states are in the same bucket of as the belief state of node . Similarly, procedure is implementation of the heuristic . (line 1) returns the set of all expanded nodes whose belief states that are at the same bucket of as the belief state of node .
4.1.2 Eliminating Duplicate Nodes
In most cases Gen-PAO expands the same node more than once. This may lead to a large expense of memory and run time when it is generated on large graphs. Taking this into consideration, we introduce the Gen-PAO-EDN (short for Gen-PAO Eliminating Duplicate Nodes) algorithm, a variation of Gen-PAO, that maintains a single OR node for every state, by eliminating all duplicate OR nodes (more precisely, preventing the expansion of duplicate nodes) which shares the same state into one OR node. There are two key differences between Gen-PAO-EDN an Gen-PAO:
- •
Gen-PAO maintains one representative OR node for each state. When Gen-PAO-EDN expands an AND node, it creates a new OR node only if its associated state is not represented by any OR node in the AND/OR graph. Otherwise, if a representative OR node to this state already exists, then the expanded AND node becomes an additional parent of the representative OR node.
- •
The AND/OR graph may contain cycles(not a tree as in AO* and Gen-PAO). A special type of cycle, called strongly connected (defined below), induces loops in the propagation phase if the cycle is a subgraph of the partial solution.
Definition 4.1.3**.**
Let be an AND/OR graph, be AND nodes, and be OR nodes. A cycle is strongly connected if for every is the preferred son of .
If the propagate method enters a strongly connected circle (which occurs when the propagation goes upwards to the ancestors), the heuristic values are re-updated every iteration, where each update raises up a bit the values of the nodes in .
In some point on of the following eventually happens:
The value of one of the AND nodes in is raised up to a level that it ceased to be the preferred successor of its OR parent. Namely, in some point in the process of update, there is an AND node , with a sibling , such that . Then becomes the preferred son. Hence, the cycle is no longer a strong connected and the loop ends. 2. 2.
The propagation process in raises up the values of the nodes in until the values are converged to a certain finite limit.
Clearly, if the values of nodes in are not converged then case 1 must hold. In case 2, the propagation may enter into an endless loop if the values are not converged in any finite iteration. In order to overcome this, each time the value of a node is updated, we check the delta , and stop the loop if , where is a small positive constant which is chosen before the run, and is the value of before the update. It should be noted that is defined to be so small, that it does not change the propagation process(i.e if case 2 holds then case 1 does not hold even if the loop would have never been stopped).
As we proposed a unifying approach to Gen-PAO, we now propose a unifying approach to the AO*. The algorithm AO-EDN is an improvement of the AO* algorithm in which unifies the OR nodes that associate with the same state. In fact AO-EDN is the same algorithm as Gen-PAO-EDN despite that it does not include heuristic HBlocked and HOpen in the propagation phase.
Chapter 5 Empirical Results
In order to evaluate our scheme we implemented alternative algorithms for the Gen-PAO and compared them by their execution time and by the size of their generated AND/OR graph (defined as the number of its nodes). Note that although the size of the AND/OR graph and the run time of Gen-PAO-EDN may decrease as a result of the heuristic propagation and nodes unification, still the algorithms described in this section requires a time exponential in the number of unknown edges, which makes this approach prohibitive for graphs with large sets of unknown edges.
5.1 Varying the Uncertainty of the Graph
In the first two experiment we explored how the uncertainty of the graph affect the performance of AO* (Section 2.4), GenPAO (Section 4.1), and AO-EDN (Section 4.1.2). The performance of each algorithm was measured for different graph sizes where each graph had different number of unknown edges. To ensure that the experiments could be performed within a reasonable time frame, the parameters were chosen so that a single run takes no more than few minutes.
Figure 5.1 compares the performance of the algorithms above on instances of basic-CTP. Figure 5.1a and figure 5.1b show respectively the change in the execution time and in the size of AND/OR graph as the number of unknown edges ascend from 2 to 12. This comparison indicates that Gen-PAO has a significant advantage in execution time over AO* since the embedded heuristics in Gen-PAO lowers dramatically the size of the AND/OR graph. Moreover, Gen-PAO has a slight advantage in execution time over AO-EDN although the size of the AND/OR graph generated by AO-EDN is smaller than the graph generated by Gen-PAO. The increased execution time of AO-EDN is incurred by the overhead of the iterative propagation in the redundancy elimination process (Section 4.1.2) in which depends on the value of default edge (default edge cost was chosen to be 100).
Figure 5.2 shows the comparison between AO-EDN and Gen-PAO on instances of Sense-CTP (the sensing cost was fixed to 0.5 for all edges). AO* was discarded from this comparison due to an extremely large execution time. In contrast to previous comparison, here AO-EDN outperforms Gen-PAO in AND/OR graph size (figure 5.2b) as well as in execution time (figure 5.2a). The elimination of redundancy nodes provides an advantage despite the overhead, since the number of expansions saved by the unification increases considerably as the number of unknown edges ascends. The plot does not contain more than 7 unknown edges since Gen-PAO consumes all the RAM on larger graphs.
It should be mentioned that since the performances of Gen-PAO-EDN (Section 4.1.2) and Gen-PAO are almost identical on instances of basic-CTP and Sense-CTP, the performance of Gen-PAO-EDN is not presented.
5.2 Gen-PAO Heuristic Estimate
5.2.1 Experimental Setting
We now define a variant of the Canadian Traveler Problem called Expensive Edges CTP (Exp-CTP in short). Exp-CTP is defined as CTP, except that each edge can be instead of . Formally, Expensive-Edge-CTP is a 6 tuple where is a graph, P and w are respectively the probability and cost functions over the edges, are the start and goal vertices, and is a positive real number. denote the probability that is cheap and denote the probability that is expensive. An agent can traverse edge whether its cheap or expensive. However, if the agent traverses and is cheap then it pays , and if is expensive then it pays , where (short for Detour cost) is a fixed cost which is higher than any edge cost(except the cost of the default edge ). In fact Exp-CTP can be defined as a subclass of CTP as well, where every unknown edge in G, has a parallel path called detour path such that the path cost of is DC and is always traversable. Namely,
- •
and
- •
and
To evaluate the performance of Gen-PAO heuristics we implemented four alternative algorithms for Gen-PAO-EDN, where on each algorithm, different heuristic was embedded in the propagation phase. Since the heuristics has almost no impact when Gen-PAO-EDN is applied on instances of basic-CTP and sense-CTP, the algorithms were executed on instances of Exp-CTP. The implemented algorithms are as follows:
- •
PAO-Blocked - Gen-PAO-EDN which propagates the heuristic values according to HBlocked (Section 4.1).
- •
PAO-Open - propagates the heuristic values according to HOpen (Section 4.1).
- •
PAO-All - propagates the heuristic values according to HOpen and HBlocked.
- •
PAO-None - basic propagation with no heuristic included.
5.2.2 Varying the Sensing Cost
In order to learn the effect of the sensing cost on the algorithms performance we conducted several runs using different fixed sensing cost(the sensing cost was equal for of all edges) on a graph that consists 8 vertices and 13 edges (10 edges are unknown). In all experiments, the probability of all unknown edges was fixed to 0.5. Figure 5.3a shows the change in the size of AND/OR graph as the sensing cost ascends from 0.1 to 1.1. This result indicates that the size of AND/OR graph (generated by all variants of Gen-PAO) decreases, as the sensing cost increases . We believe that this can be attributed to the increased number of expanded states in the AND/OR graph incurred by the low sensing cost, in which makes the sensing action worthwhile. In particular, there exists a limit , such that every sensing cost below makes the Sense actions always preferable over the Move actions. This causes many expansions of Sense nodes and expansion of new belief state (that are not reachable without preforming Sense) which results in a large AND/OR graph. The comparison of the algorithms shows that PAO-None generates a relatively small AND/OR graph for low sensing cost, while PAO-Blocked and PAO-All has advantage on high sensing cost. This is also true for larger graphs that contain larger sets of unknown edges. We believe that this effect can be explained by the fact that on low levels of sensing cost, it is worthwhile to sense unknown edges, in which improves the estimate accuracy of the heuristic values(on low cost). The high accuracy level of the heuristic estimate leads to low rates of pruning since the heuristics HBlocked and HOpen are based the gap between the real and estimated value, which is small in this case. Thus, a large AND/OR graph was obtained. A comparison of the run time (figure 5.3b) shows that the run time extends as the size of the AND/OR graph increases. The reason for this positive correlation is obvious: the increased size of the graph leads to larger computation time required for expanding the states, as well as for propagating the heuristic values to a larger set of states.
5.2.3 Varying the Open Probability
In this experiment we investigated the effect of distribution over the edges on the performance of variants of Gen-PAO-EDN (Section 4.1.2). In order to perform simple experiment that analyzes this effect, we configured the graph such that all unknown edges was open with the same value of fixed probability, called open probability, which is given as an input. Figure 5.4 illustrates the performance of different heuristics on a graph that consists 19 unknown edges for DC=7 and DC=9 . Figures 5.4a and 5.4c show the change in the size of the AND/OR graph size as the open probability ascends from 0.1 to 0.9. These results indicates that for all algorithms there exists a certain value of open probability (p=0.5 on figure 5.4a and p=0.3 on figure 5.4c) such that for any value of open probability (called low open probability) smaller than the size of the AND/OR graph increases as rises, while for any value of open probability larger than (called high open probability) the size of the AND/OR graph decreases as the rises. We call low open probability and high open probability
This can be explained by the following reasons(referred to AND/OR graphs generated by all algorithms):
- •
On high open probability most of the decision nodes (OR nodes) decides correctly their best action node when first expanded without changing their decision afterwards, and thus, relatively large portion of the expanded states is also a part of the optimal policy graph and the AND/OR graph is relatively small. However, as the open probability lowers, the AND/OR graph size increases since the heuristic estimates are less accurate and more alternative actions are considered for the optimal policy. This leads to an excessive expansion of nodes and a larger AND/OR graph.
- •
On low open probability, as the open probability lowers, the graph becomes “more blocked”, and the default path becomes preferable. In such cases, all variants of Gen-PAO-EDN tend to prune action nodes that are not associated with the default path (sensing or traversing edges that are not in the default path) and, as a result, a smaller AND/OR graph is obtained.
The comparison between the heuristic of Gen-PAO-ELN shows advantage of PAO-Blocked and PAO-All on low open probabilities. This is due to the high pruning rate incurred by HBlocked on low open probability, where the gap between the heuristic values the real values are high. Again HBlocked is effective since the chances that heuristic value of different belief state will be updated are high (see conditions of HBlocked in section 4.1).
Figures 5.4b and 5.4d show the time spent by the four algorithms. As in previous experiments, there is a tight correlation between the execution time and the AND/OR graph size. The size of the AND/OR graphs generated by PAO-Blocked and PAO-All are smaller then PAO-None on all levels of open probability, however, the advantage on runtime of PAO-Blocked and PAO-All occurs on low open probability.
5.3 Value of Clairvoyance
In order to get some general indication of the total value of information, we checked the ratio (see Papadimitriou 1991), denoted by , on instances of basic CTP and Exp-CTP. is defined as where is the expected cost of the optimal policy and AS is the expected cost of the optimal policy given that the graph is fully observable (can be also described as the expected cost of the policy Always Sense when the sensing cost is 0 (see [Bnaya,Felner and Shimony]). Formally, Let be the paths in the graph ordered by their path cost, be the probability that path is traversable, and be the path cost of path then can be described as follows:
[TABLE]
We performed experiments on instance of basic CTP for different values of open probabilities and values of the default edge. Results for graph 7V11E (figure 5.5) shows that is relatively high on low values of the default edge (where default edge cost is 20). This can be explained by the fact that is relatively low since the agent would not traverse the default edge if there exists an open path to the target (in addition to the default edge) however is almost high as the cost of default edge since it is usually worthwhile to traverse the default edge when MaxEdge is low (note that is always lower than MaxEdge). In addition, is high on low open probabilities (i.e. on ) , since the the graph “tends” to be blocked and the default edge is preferable over the “cheap” paths. Tough on extremely high cost of the default edge (not illustrated in the figure), i.e. on , RV is low even on low open probability (around 1.3), since the agent takes the default edge only if there is no open path other then the default edge.
An analogue experiment was performed on instanfce of Exp-CTP for the same graph as used on previous experiment. RV was measured for different values of DC and open probabilities while default edge cost remained fixed (default edge cost is 200). Figure 5.6 shows that the result is qualitatively similar to the results of the previous experiment, however lower value of RV were obtained in all domain. The reason for this similarity is the same as in the previous experiment, despite that now, the agent prefers to traverse the detour path instead of the default edge. RV is lower than in previous experiment since the paths cost, on average, is higher (it is sometimes required to pay DC several times) and thus is higher.
Chapter 6 Summary
6.1 Contributions
In this thesis we explored the Canadian traveler problem theoretically and empirically. In the context of theoretical analysis the following theorems has been proved:
- •
Correlated-CTP is at least as hard as Sensing-CTP.
- •
CTP-PATH-DEP is NP-hard.
- •
CTP-FOR-DEP is solvable in polynomial time.
- •
Properties of Belief MDP for CTP.
The main aspect of the practical analysis is the framework of Gen-PAO, where its main contributions are:
- •
Gen-PAO extends the PAO* algorithm such that it is not restricted to special types of graphs.
- •
Gen-PAO optimally solves instances Exp-CTP and sensing CTP in addition to basic CTP.
- •
Two heuristics HBlocked and HOpen have been proposed. HBlocked and HOpen can be plugged in Gen-PAO and in some cases reduces the size of the AND/OR graph and the execution time.
In addition, we analyzed the parameter RV for instances of Exp-CTP and basic CTP and showed its general behivior.
6.2 Future work
There is a lot remained to be done in theoretical analysis of the CTP, and in particular classifying other subclasses of the CTP. On the practical aspect, Gen-PAO can be further modified to solve other type of CTP such as Correlated CTP and multi-agent CTP. Moreover, we believe that Gen-PAO can be further enhanced by aiming it to other type of POMDP problems. It might be worth consideration to improve the performance of Gen-PAO by implementing heuristics that specialize in specific type of graphs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. F. Anthony, A. Stentz, and S. Thrun. Pao* for planning with hidden state. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation , pages 2840–2847, 2004.
- 2[2] A. Bar-Noy and B. Schieber. The canadian traveller problem. In SODA , pages 261–270, 1991.
- 3[3] Z. Bnaya, A. Felner, and S. E. Shimony. Canadian traveler problem with remote sensing. In IJCAI , pages 437–442, 2009.
- 4[4] S. Koenig, C. A. Tovey, and Y. V. Smirnov. Performance bounds for planning in unknown terrain. Artif. Intell. , 147(1-2):253–279, 2003.
- 5[5] E. Nikolova and D. R. Karger. Route planning under uncertainty: The canadian traveller problem. In AAAI , pages 969–974, 2008.
- 6[6] C. Papadimitriou and M. Yannakakis. Shortest paths without a map. Theor. Comput. Sci. , 84(1), 1991.
