Intermittently Observable Markov Decision Processes
Gongpu Chen, Soung-Chang Liew

TL;DR
This paper addresses decision-making in Markov Decision Processes with unreliable, intermittent state information, proposing new formulations and algorithms to find near-optimal policies efficiently despite information losses.
Contribution
It introduces a belief MDP and a tree MDP formulation for intermittent observations, along with finite-state approximations and a nested value iteration algorithm for improved efficiency.
Findings
Finite-state approximations enable near-optimal policy computation.
Nested value iteration outperforms standard methods in speed.
Numerical results confirm the effectiveness of proposed algorithms.
Abstract
This paper investigates MDPs with intermittent state information. We consider a scenario where the controller perceives the state information of the process via an unreliable communication channel. The transmissions of state information over the whole time horizon are modeled as a Bernoulli lossy process. Hence, the problem is finding an optimal policy for selecting actions in the presence of state information losses. We first formulate the problem as a belief MDP to establish structural results. The effect of state information losses on the expected total discounted reward is studied systematically. Then, we reformulate the problem as a tree MDP whose state space is organized in a tree structure. Two finite-state approximations to the tree MDP are developed to find near-optimal policies efficiently. Finally, we put forth a nested value iteration algorithm for the finite-state…
| NVI | VI | R-NVI | ||
|---|---|---|---|---|
| Time (s) | 16 | 44 | 63 | |
| Iteration | 60 | 48 | 288 | |
| Time (s) | 86 | 259 | 310 | |
| Iteration | 62 | 53 | 265 | |
| Time (s) | 103 | 268 | 301 | |
| Iteration | 58 | 49 | 196 | |
| Time (s) | 417 | 1316 | 1461 | |
| Iteration | 53 | 47 | 235 | |
| Time (s) | 180 | 580 | 649 | |
| Iteration | 65 | 57 | 285 | |
| TA() | TA() | TA() | ||
| Value | 366 | 368 | 368 | |
| Time (s) | 0.48 | 34 | 0.64 | |
| Value | 310 | 318 | 318 | |
| Time (s) | 0.52 | 34 | 0.65 | |
| Value | 196 | 215 | 215 | |
| Time (s) | 0.45 | 30 | 0.72 | |
| Value | 155 | 175 | 175 | |
| Time (s) | 0.44 | 30 | 0.80 | |
| TA(2) | TA(3) | TA(6) | |
|---|---|---|---|
| 69 | 69 | 69 | |
| 90 | 90 | 90 | |
| 192 | 193 | 193 | |
| 395 | 397 | 397 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Grid Energy Management · Reinforcement Learning in Robotics · Age of Information Optimization
Intermittently Observable Markov Decision Processes
\nameGongpu Chen \[email protected]
\nameSoung Chang Liew \[email protected]
\addrDepartment of Information Engineering
The Chinese University of Hong Kong
Shatin, Hong Kong SAR, China
Abstract
This paper investigates MDPs with intermittent state information. We consider a scenario where the controller perceives the state information of the process via an unreliable communication channel. The transmissions of state information over the whole time horizon are modeled as a Bernoulli lossy process. Hence, the problem is finding an optimal policy for selecting actions in the presence of state information losses. We first formulate the problem as a belief MDP to establish structural results. The effect of state information losses on the expected total discounted reward is studied systematically. Then, we reformulate the problem as a tree MDP whose state space is organized in a tree structure. Two finite-state approximations to the tree MDP are developed to find near-optimal policies efficiently. Finally, we put forth a nested value iteration algorithm for the finite-state approximations, which is proved to be faster than standard value iteration. Numerical results demonstrate the effectiveness of our methods.
Keywords: MDP, state information losses, truncated approximation, structural results, nested value iteration.
1 Introduction
Markov decision process (MDP) is a widely adopted model for sequential decision-making in discrete-time stochastic control problems. At each time step, the process is in some state , and a controller is responsible for selecting an action from the set of available actions of state . Upon the execution of action , the process stochastically transitions to a new state at the next time step and generates a reward . Solving the MDP consists in finding an optimal policy of selecting actions for the controller to maximize the total reward over a time horizon. It is well-known that the current state of the process is sufficient for computing the optimal action at any time step (Puterman, 1994). We thus refer to a policy as a mapping from state space to action space.
At every time point of decision-making, the controller needs the current state information to determine the action to be applied. In the classical setting, the current state information is assumed to be always available to the controller. This assumption, however, is not valid in many practical applications. A typical example is a situation where the controller relies on a remote sensor for perceiving the state information of the process; at each time step, the remote sensor observes the current state of the process and transmits the information to the controller via wireless communication. Such a situation is ubiquitous nowadays as wireless sensor network technology is being rapidly developed and deployed (Yick et al., 2008; Kandris et al., 2020). However, despite its many advantages, wireless communication is generally not reliable. In particular, information transmissions over a wireless communication channel may occasionally fail due to channel fading and environmental interference (Tse and Viswanath, 2005). As a result, the state information can not always be delivered successfully from the remote sensor to the controller; hence the controller has to select actions for the MDP in the presence of state information losses. We refer to this problem as intermittently observable MDP (IOMDP) and investigate it in this paper from theoretical and algorithmic perspectives.
Studies on MDPs with imperfect information transmissions can be traced back to Brooks and Leondes (1972), who investigated MDPs with one-step delayed state information. Continuing along this line of research, Katsikopoulos and Engelbrecht (2003) extended the framework and showed that an MDP with delays (whether constant or random delays) can be reduced to an MDP without delays, differing only in the size of state space. Adlakha et al. (2012) studied a group of coupled MDPs with delayed state information and provided a bound on the finite history of observations and control needed for the optimal control. In addition, state information delays have also been considered in partially observable MDPs (POMDPs) (Kim and Jeong, 1987; Bander and White III, 1999) and decentralized control problems (Hsu and Marcus, 1982; Varaiya and Walrand, 1978).
While delayed state information has been extensively investigated, intermittent state information has not attracted particular attention in MDP studies. Perhaps this is because IOMDPs naturally fall into the category of POMDPs. Specifically, if the state information is not received by the controller at some time step, then the previously received state information could serve as a noisy observation that implies a distribution of the current state. The mainstream method for treating a POMDP is to reformulate it as a fully observable MDP by constructing belief states (Astrom, 1965; Krishnamurthy, 2016). The resulting MDP is thus sometimes called belief MDP (Kaelbling et al., 1998). Under this formulation, it can be proved that the value function of the belief MDP is a piecewise linear and convex function of the belief state (Smallwood and Sondik, 1973; White III and Harrington, 1980; Araya et al., 2010). This nice property has been extensively exploited to develop algorithms for finding optimal policies, resulting in the one-pass algorithm (Smallwood and Sondik, 1973), the linear support algorithm (Cheng, 1988), incremental pruning (Cassandra et al., 2013), the duality-based approach (Zhang, 2010), and others. In addition, there are also many algorithms that approximate the exact value iteration solution, including the point-based value iteration (Pineau et al., 2003; Porta et al., 2006), heuristic search value iteration (Smith and Simmons, 2012), and others (Shani et al., 2007; Kurniawati et al., 2008; Poupart et al., 2011).
Although there is a rich set of algorithms for POMDPs, finding optimal or near-optimal policies is still computationally complex. A major reason is the so-called curse of dimensionality — given a problem with physical states, the belief state space of the associated POMDP is the -dim probability simplex, which is a continuous space. As a result, general POMDP algorithms are inefficient for solving IOMDPs. Fortunately, this paper finds that the special structure of the problem can be exploited to find near-optimal policies efficiently.
In this paper, we model the transmissions of state information as a Bernoulli process whose parameter (i.e., the state information reception probability) is determined by the communication environment. We first adopt the belief MDP formulation to analyze the effect of state information losses on the optimal value. Then, we reformulate the IOMDP using the sufficient history—a segment of history information that is sufficient for making the optimal decision. The set of sufficient histories can be organized in a tree structure. We thus call the new formulation a tree MDP. Based on the tree MDP formulation, we propose two efficient finite-state approximations for the problem to find near-optimal policies.
The contributions of this paper are as follows:
- •
We show that the optimal value monotonically increases with the state information reception probability and provide a bound for the performance regret caused by state information losses.
- •
We reformulate the IOMDP as a tree MDP and propose a truncated approximation to the tree MDP for finding near-optimal policies. A theoretical bound for the approximation error is derived to better understand the method.
- •
We find that a majority of the states of the tree MDP are redundant in the sense that they will never be visited. This motivates us to put forth the high-order truncated approximation—a modified truncated approximation that could identify the redundant states and omit them when computing policies. It is more efficient and scalable than the original truncated approximation because of the reduction of state space.
- •
We propose a variant of value iteration, called nested value iteration, for the truncated approximation of the tree MDP. We show theoretically that the nested value iteration converges to the optimal value function faster than the standard value iteration. It also applies to the high-order truncated approximation.
The rest of the paper is organized as follows. Section 2 presents the preliminaries and problem formulation. Section 3 establishes fundamental structural results. Section 4 introduces the tree MDP formulation and proposes the truncated approximation. Section 5 develops the high-order truncated approximation. Section 6 proposes the nested value iteration algorithm. Section 7 demonstrates the experiment results. Finally, section 8 concludes this paper.
2 Problem Statement
2.1 Preliminary: MDP
An MDP is defined by a tuple , where is the state space, is the action space, is the transition kernel, is the reward function, is the discount factor. This paper assumes that both and are finite sets. For simplicity, we index the states and actions by letting and . Let and denote the state and action at time . The following expressions will be used interchangeably:
[TABLE]
We will use to denote the transition matrix associated with action . Solving the MDP consists in finding an optimal policy that maximizes the expected total discounted reward
[TABLE]
More generally, given a distribution of the initial state, we can define
[TABLE]
where is the ()-dim probability simplex defined as
[TABLE]
At each time , the state information is transmitted to the controller, based on which the controller computes an action and applies to the process. Upon the execution of , the process generates a reward and transitions to state at the next time step. In the classical setting, the transmissions of are always timely and reliable. Then the optimal policy for the MDP can be determined by the Bellman equation:
[TABLE]
2.2 Intermittently Observable MDP
Consider a scenario where the controller relies on a remote sensor for perceiving the state information of the process. Suppose that the controller and the remote sensor are physically separated and communicate via an unreliable wireless channel. Due to environmental interference and channel fading, transmissions over the wireless channel may occasionally fail. As a result, the state information observed by the sensor may not be always delivered successfully to the controller. This paper proposes the IOMDP framework to address the problem of selecting actions for the MDP with intermittent state information.
Formally, define a Bernoulli random variable as an indicator for the transmission of . Let if is successfully transmitted to the controller before the time point of determining and otherwise. We assume there is no delayed arrival of state information, i.e., the controller will never receive if . Further, assume that is a sequence of i.i.d. random variables and that
[TABLE]
Quantity is referred to as the state information reception probability (SIRP). An IOMDP with SIRP is defined by the tuple .
To analyze the IOMDP, we reformulate the problem as a belief MDP denoted by . Specifically, is the set of belief states, is the transition kernel of belief states, is the reward function. The remaining symbols are of the same meaning as before. The belief MDP is a special kind of POMDP, in which we use a belief state to represent a probability distribution over the state space . Let denote the time index of the -th successfully transmitted state information. Due to possible transmission failures, may be greater than 1. Suppose there are in total successful transmissions up to time (i.e., ), then is the set of all the state information received by the controller up to time . The belief state at time , denoted by , is a sufficient statistic for the given history:
[TABLE]
The second equality above follows from the Markovian property. It means that the belief state depends on the latest received state information and the following actions. We will refer to the sequence as the sufficient history at time . Note that with its -th element, denoted by , being the probability of conditioned on the sufficient history. Specially, if (i.e., the controller receives ), then reduces to a one-hot vector. We will use to denote the one-hot vector with the -th element being 1 and other elements being 0. If is not delivered successfully, then is determined by and . In particular, the transition probability of the belief MDP is given by
[TABLE]
where denotes the -th element of vector . Finally, the reward function of the belief MDP is given by
[TABLE]
The belief MDP can be viewed as a fully observable MDP; hence its optimal policy can be determined by the Bellman equation:
[TABLE]
where we use to denote the optimal value function of . In the context that is fixed, we may omit and simply write the value function as . For every and , can be viewed as the expected one-step reward generated by the underlying MDP given that the current state follows a distribution . Therefore, is the maximum expected total discounted reward that can be obtained from the underlying MDP with an initial belief state and SIRP .
In principle, the belief state can be any element in (i.e., ). Then the belief MDP has a continuous state space. However, as implied by (4), given any initial belief state , the set of possible belief states is countable, i.e.,
[TABLE]
Unless otherwise specified, we will consider instead of to avoid technique issues related to measurability.
3 Structural Results
This section establishes structural results that are fundamental for understanding IOMDPs. The aim is to identify the relationship between the state information reception probability and the optimal value function.
Let denote the set of stationary deterministic policies for the belief MDP. It is well-known that contains at least one optimal policy; hence we will focus on policies in . For any , let denote the action taken by policy in belief state . Denote by the value function of under policy and the associated transition matrix. Their relationship can be expressed in vector form as
[TABLE]
where is the vector form of the value function over the belief state space , and is the vector of rewards associated with policy . Likewise, we will denote by the vector form of the optimal value function of .
We first introduce a useful lemma. For any , is a POMDP whose reward function is linear with the belief state. The piecewise linearity and convexity of the optimal value function have been extensively exploited in POMDPs of this kind (Smallwood and Sondik, 1973). Formally,
Lemma 1
For any , is a piecewise linear and convex function of .
With Lemma 1, we are ready to establish a key result that reveals the relationship between the optimal value function and the SIRP.
Theorem 2
For any , is an increasing function of .
**Proof ** See Appendix A.
Theorem 2 is consistent with our intuition. It shows that the optimal value function of the belief MDP is monotonically increasing with the state information reception probability. In other words, the more state information the controller can receive, the greater its expected total discounted reward from the underlying MDP.
The following result is useful in robustness and worst-case performance analysis.
Corollary 3
For any , suppose that is an optimal policy for . Then for any and , .
**Proof ** The second inequality has been established in the proof of Theorem 2. Since , it follows from Theorem 2 that .
Although straightforward, the lemma below is interesting and worth being highlighted. If we apply a value iteration algorithm to solve with any , is a good starting point for the value iteration. As we know, a good starting point may considerably speed up the value iteration algorithm. Since the underlying MDP is much easier to be solved than the belief MDP, the following result can be used in practical computations to reduce computation time.
Lemma 4
For , the optimal value function can be derived from . In particular,
, ;
- 2.
For ,
[TABLE]
**Proof ** The initial belief state being means that the controller has observed that the initial state of the underlying MDP is . Since , the controller can receives for all . As a result, the belief state always remains in the set and the belief MDP reduces to the original MDP. Hence . Statement 2 follows immediately from the Bellman equation and statement 1.
Clearly, for all . It is interesting that, for any , given by Lemma 4 is not equal to defined in (2). Both of them consider that the initial state of the underlying MDP follows a distribution . However, in the classical setting assumes that the controller can observe exactly the initial state before it computes the first action ; hence depends on the specific realization of and is always optimal. By contrast, corresponds to the case that the controller does not know the exact initial state and needs to determine only based on ; consequently, is likely to be sub-optimal for a particular realization of . Therefore, and the difference is the performance regret of the belief MDP generated by the uncertainty at the first time step.
The above discussion is useful for understanding the next theorem, where we provide a bound for the performance regret caused by state information losses.
Theorem 5
For any , is continuous at any . In addition, denote by an optimal policy for . Then, for any ,
[TABLE]
where is given by (let )
[TABLE]
**Proof ** See Appendix A.
Basically, the upper bound for the performance regret is obtained by applying the optimal policy of (i.e., ) to the belief MDP and comparing the resulting value function with . Start from a belief state , and generate the same instantaneous reward when they are controlled by the same policy ; the performance regret comes from the second step. For under policy , the belief state transitions to with probability for each . By contrast, for under policy , the belief state transitions to with probability and transitions to with probability . The regret comes from the latter transition, as illustrated above. We can think of the upper bound of stated in Theorem 5 as the expected total discounted regret of a Markov regret process (MRP). In particular, the MRP is governed by the transition matrix , and the regret at state is .
4 Truncated Approximation
Although the belief MDP is much simpler than general POMDPs, it is still hard to solve exactly due to the countably infinite state space. We thus reformulate the IOMDP as a tree MDP, based on which we propose two finite-state approximations for finding near-optimal policies. In the following sections, we consider to be fixed but arbitrary. We thus drop from notations to simplify the expressions. For example, will be written as and the associated vector form will be written as .
4.1 Tree MDP
We next reformulate the IOMDP to ease expositing of the finite-state approximations. As defined in (3), the belief state at any time is determined by the sufficient history at the controller. In particular, a sufficient history, denoted by , is a tuple consisting of a state and a sequence of ordered actions, where and for all . When , we will simply write . Denote by the set of all sufficient histories. Then according to the definition of belief states, and are connected by function defined as
[TABLE]
where for . Note that function is not necessarily a one-to-one mapping. It is possible that for .
The set can be hierarchically organized in a tree with roots and infinite layers. In particular, each element in corresponds to a root. For , denote the set of sufficient histories with actions in the tuple. Then set consists of the nodes at layer of the tree. Every has children, each corresponds to adding an extra action at the end of the tuple, i.e., . We then define recursively by
[TABLE]
Fig.1 is an example of sufficient histories organized in a tree structure. Using the terminology in tree structures, will be referred to as the parent node of . Except for the root nodes (i.e., those in ), each node in the tree has a unique parent. In addition, and are disjoint for any and . For any positive integer , let denote the set of nodes from layer 0 to layer .
Considering as the state space, we can reformulate the IOMDP as an MDP denoted by . In particular, the reward function is and the transition kernel is given by (cf. eq.(4))
[TABLE]
We call a tree MDP because of the tree structure of its state space. To distinguish from the belief state, the state of the tree MDP at time , denoted by , will be referred to as the position state. Let denote the optimal value function of . We can easily identify the relationship between the two formulations of the IOMDP, as stated in the lemma below. The proof is provided in Appendix B for the sake of completeness.
Lemma 6
For the two formulations and ,
* if ;*
- 2.
* for any satisfying ;*
- 3.
Denote by and the optimal policies for and , respectively. Then if .
The two formulations, although similar, have different advantages. As shown in the previous section, the belief MDP formulation allows mathematical analysis of the optimal value function, thus establishing some fundamental properties of the IOMDP. By contrast, we find it convenient to develop finite-state approximations and compute near-optimal policies for the problem using the tree MDP formulation.
4.2 Truncated Approximation
Our first finite-state approximation for is inspired by a simple fact: starting from the root layer, the position state can arrive at layer only if the controller suffers consecutive transmission failures. The probability of such an event is , which decreases exponentially as increases. When the probability of visiting is small enough, the expected performance regret caused by taking a sub-optimal action in should be negligible.
Let us consider the following -truncated approximation.
Definition 7** (-Truncated Approximation, TA())**
Given a positive integer , the -truncated approximation of is an MDP denoted by . In particular, the state space of is . The action space , SIRP , reward function , and discount factor are identical to those of . The transition kernel is defined as follows:
- (1)
For any with , for all and ;
- (2)
For any , let and , where . Then
[TABLE]
The state space of TA() consists of position states at the first layers. Therefore, for any , is not a state of . We thus redefine the transition probabilities for in . Fig.2 compares the transition probabilities, in and , of taking a particular action in . TA() can be viewed as an adaptation for practical operations. That is, the controller does not update the position state (and the associated belief state) after it experiences more than consecutive transmission failures until it receives a new state observation. We will use to denote the optimal value function of . It turns out that is a good approximator of for part of . The theorem below provides a bound for the approximation error of TA().
Theorem 8
For any and positive integer ,
[TABLE]
where .
**Proof ** See Appendix B.
The upper bound presented in Theorem 8 is insightful. For , we refer to as the layer index of . According to Theorem 8, the error bound is an increasing function of for any fixed , implying that the difference between the two value functions can be larger for in layers closer to the truncated layer (i.e., ). The monotonicity of the error bound w.r.t. the layer index can be interpreted as follows. The approximation error comes from the truncated layer since the position states in that layer have different transition probabilities from . Starting from the -th layer, the position state can arrive at the truncated layer only if the controller experiences at least consecutive transmission failures, with the associated probability and discounting factor being and , respectively. As a result, when , the error bound reduces to , which can be arbitrarily small if is large enough. While for , the approximation error is dominated by the term , which may be non-negligible.
A near-optimal approximation for the value function of an MDP is useful because it can be used to derive a near-optimal policy. For the tree MDP , the optimal action of is determined by the Bellman equation
[TABLE]
where and . Therefore, if is very close to for all , we can obtain a near-optimal action for every by replacing with in (11). We say that the resulting action is -optimal for in position state if for all .
Denote by an optimal policy for and the associated action for . According to Theorem 8, given any and positive integer , there exists a finite integer such that for all and . Consequently, for any and , is -optimal for . With , we can derive a policy for to control the underlying MDP in the presence of state information losses. In particular,
Definition 9** (TA() Policy)**
For any positive integer , define the TA() policy for as follows:
- (1)
For , the TA() policy takes action ;
- (2)
For , denote by the ancestor node of in . Then the TA() policy takes action in position state .
Intuitively, the larger the value of , the closer the TA() policy is to the optimal policy. More analysis about the TA() policy will be provided in the next section.
5 High-order Truncated Approximation
Although TA() can be used to determined an -optimal action for each , we typically have for small . Hence a relatively large is needed to obtain a satisfactory policy. This makes the method inefficient and usually prohibitively complex since the state space of TA() increases exponentially with :
[TABLE]
As a result, TA() is limited to IOMDPs with large and small action spaces. We next put forth a more efficient and scalable approach to finding near-optimal policies for . The following concepts are important for developing the approach.
Definition 10
Given a policy , a position state of is said to be reachable under policy if it can be visited within a finite time with a positive probability. A position state is redundant if it is not reachable.
As discussed around (8), every position state has children, among which can be visited only if the controller takes action in and fails to receive . Therefore, using a deterministic policy, each position state has only one reachable child. That is, each layer in the tree contains exactly reachable position states. If we fix the optimal action for to be, say , then will never be visited if (except for the initial position states). That is, under the optimal policy, is reachable, and the remaining children of are redundant. As far as computing the optimal policy is concerned, we can omit the redundant position states and remove them from the state space.
While the number of reachable position states at each layer is fixed (i.e, ), the total number of position states increases exponentially over layers—most of them are redundant. This is the source of the inefficiency and poor scalability of TA(). The computation complexity can be significantly reduced if we can remove the redundant position states when computing a policy.
Let us consider the following idea to compute -optimal actions for . Instead of solving for TA() with a large , we first select an so that is large enough for computing -optimal actions for all by solving TA(). Upon determining an -optimal action for every , we can identify the reachable position states in . The next step is to compute -optimal actions for the reachable position states in —we do not compute actions for the redundant position states since they will never be reached. For this purpose, we construct a modified TA(, which will be referred to as TA(), by removing the redundant position states in and their descendants at other layers from the state space of TA(). Solving for TA() is considerably easier than TA() due to the reduction of state space, but it still provides -optimal actions for the reachable position states in . Doing so repeatedly, we can compute -optimal actions for reachable position states layer by layer. We call this method high-order truncated approximation, where the -th order truncated approximation refers to repeating the above procedure times and will be denoted by TA(). We will justify this method after providing a formal definition below.
We formally define TA() in a recursive way for and a fixed . In particular, TA() is an MDP denoted by with . Let stand for the optimal policy for . As discussed before, with a proper , is an -optimal action for in position state . Our aim is to define a series of such that, for every reachable in , is an -optimal action for . Meanwhile, the redundant position states in and their descendants at other layers will be excluded from the state space of to reduce computation complexity.
Mathematically, is an MDP with state space
[TABLE]
where is the set of reachable position states in under the policy :
[TABLE]
and is the set of descendants of at layer :
[TABLE]
Fig. 3 gives an example of the state space of TA() for . The action space for any is given by
[TABLE]
The first line of the above definition means to fix the action for the already determined reachable states at layers 0 to . With this restriction, we have for all . The SIRP , reward function , and discount factor are identical to those of . The transition kernel is defined as follows:
- (1)
For , for any and ;
- (2)
For , let and , where . Then
[TABLE]
This completes the definition of . Note that for all and that can be determined with . Starting from and , we compute the optimal policy for and then use and to identify (the set of reachable position states in ); then the descendants of at the following layers (i.e., for ) can be identified. On this basis, we construct to compute the actions for position states in . Doing so iteratively with a proper yields the optimal actions of the reachable position states layer by layer (see Theorem 13). We can stop when is small enough and control using the TA() policy derived from :
Definition 11** (TA() Policy)**
For any positive integers and , define the TA() policy for as follows:
- (1)
For , the TA() policy takes action ;
- (2)
For , denote by the ancestor node of in . Then the TA() policy takes action in position state .
We next establish the relationship between and , and then analyze the TA() and TA() policies. Denote by the optimal value function of . The following lemma is useful.
Lemma 12
For any fixed , if and have the same optimal action for every , then for any .
**Proof ** See Appendix C.
For the tree MDP , the Bellman equation (11) can be expressed as , where is the state-action value function:
[TABLE]
Denote by an optimal action of . Let
[TABLE]
Since the action space is a finite set, we have for any . As mentioned before, given the optimal policy for and the associated optimal value function , we say that is an -optimal action for in position state if for all . The following fact can be easily verified:
Fact 1 *For any in the tree MDP , if is an -optimal action with , then, in fact, is optimal for . *
According to Fact 1, define
[TABLE]
We show in the following theorem that, with a proper , solving for yields -optimal (thus optimal) actions for all reachable position states at layers 0 to . Recall that and denote the optimal policies for and , respectively.
Theorem 13
For any positive integer , given an satisfying
[TABLE]
where . Then,
- (1)
* is an optimal action for any in .*
- (2)
* is an optimal action for any in .*
**Proof ** See Appendix C.
Theorem 13 shows that both TA() and TA() can be used to obtain optimal actions for reachable position states at layers 0 to . However, it is easy to see that TA() is much more efficient than TA(). The cardinality of the state space of is increasing linearly with , that is,
[TABLE]
where . By contrast, (12) shows that increases exponentially with for any fixed . Although solving needs to solve for all , the overall computation complexity of TA() is still significantly lower than that of TA(). In addition, if we use a value iteration algorithm to solve , then the optimal value function of is a good starting point for the value iteration. Since is close to for . It is well-known that the value iteration converges quickly if the starting point is near the optimum.
6 Nested Value Iteration
We have shown in the previous section that solving for TA() and TA() could offer near-optimal policies for the IOMDP. This section presents a variant of the value iteration algorithm, called nested value iteration (NVI), to compute the optimal policy for TA() efficiently. It also applies to TA().
The algorithm is inspired by the special structure of the tree MDP. As indicated by (11), position states at the root layer (i.e., ) play a central role in the Bellman equation since their values directly affect the values of all position states. Intuitively, for a non-zero , the larger the layer index of , the weaker the influence of ’s value on the overall value function. The basic idea of NVI is to update the values of important position states more frequently than the less important ones, as shown in Algorithm 1.
In particular, based on the tree structure of the state space of , we define a collection of nested sets as follows:
[TABLE]
Note that for all and the largest set is —the entire state space of . As commented in Algorithm 1, NVI consists of outer and inner iterations. Each outer iteration consists of inner iterations, where the -th inner iteration only updates the values of position states in set (). Consequently, position states in are updated in every inner iteration, while position states in are updated once in every outer iteration (i.e., every inner iterations).
The following theorem is useful for understanding the NVI algorithm.
Theorem 14
Let denote the sequence of functions generated by the nested value iteration algorithm with input . Then
- (1)
* converges in max norm to the optimal value function ;*
- (2)
Let , then for all ,
[TABLE]
where and .
**Proof ** See Appendix D.
Theorem 14 is useful for interpreting the advantage of NVI to the standard value iteration. For , the convergence rate of NVI is identical to that of standard value iteration in general cases (Puterman, 1994), that is,
[TABLE]
However, the computation complexity (in terms of the number of states to be updated) of consecutive inner iterations is significantly lower than that of standard iterations. That is, NVI speeds up the convergence of position states in the upper layers of the tree structure. The fast convergence of root position states, in turn, would speed up the convergence of other position states due to the strong influence of root position states on the overall value function. For example, the value of is updated once in each outer iteration; its convergence rate over outer iterations, according to Theorem 14, is , which is smaller than the rate in standard value iteration.
Although Algorithm 1 is a natural form of NVI, modifying Algorithm 1 by defining the collection of nested sets in a different way may also perform well. The basic principles are: (i) the root position states in the tree should be contained in the smallest set so that they can be updated more frequently than others; (ii) the largest set should be the entire state space so that all states can be updated at least once in each outer iteration. It is also worth noting that the order of inner iterations within each outer iteration matters. We call the current order descending because the updating set shrinks from to in each outer iteration. Be careful that adopting the ascending order may lead to a worse convergence rate. This can be seen by using a similar argument as in the proof of Theorem 14. We do not provide details here since it is not the focus of this paper. Numerical results will be provided in the next section to validate the importance of descending order of inner iterations.
Finally, extending NVI to solve TA() is straightforward. For with , the action of is already determined, hence there is no need to update the value of these position states. The values of root position states are identical to that of , which can be used to update the value of other position states. We thus can define the collection of nested sets as follows:
[TABLE]
Replacing with in Algorithm 1, NVI can be used to solve TA(). Our previous discussions also imply that using the optimal value function of as the starting point would help NVI converge quickly to the optimal value function of , as the values of root position states are already optimum.
7 Numerical Results
We present numerical results to demonstrate the effectiveness of our methods. Throughout this section, the state information transmissions of all experiments were simulated by Bernoulli trials generated by Matlab on PC. At each time step, the controller of the IOMDP receives the current state information only if the outcome of the Bernoulli trail is 1 (another outcome is 0). The probability of the outcome being 1 in each Bernoulli trial corresponds to the transmission success probability (i.e., SIRP) of the experiment.
We first compare NVI with standard value iteration (VI) in a group of randomly generated MDPs, as shown in Table 1. The results of reverse NVI (R-NVI) are added to demonstrate the importance of the descending order of inner iterations. Specifically, R-NVI refers to NVI with ascending order for inner iterations (i.e., the -th inner iteration of each outer iteration updates states in ), as discussed in Section 6. For each experiment in Table 1, the transition matrices and the reward function of the underlying MDP were generated randomly with given state and action spaces. We then constructed the associated TA() and computed its optimal policy using the three algorithms. In all experiments, NVI and R-NVI converge to the same value function as VI, leading to the same policy. It shows that NVI converges significantly faster than the other two algorithms. We also list the number of iterations needed for each algorithm to converge; for NVI and R-NVI, the numbers of inner iterations are counted. As we can see, the number of iterations of NVI is slightly larger than that of VI. Note that the states in are updated every inner iterations. This means that NVI finds the optimal values for position states in with much less updates than VI. The experiments verify our analysis under Theorem 14 that the fast convergence of root position states would speed up the convergence of other position states, illustrating why NVI is faster than VI. By contrast, the number of iterations of R-NVI is considerably larger than that of VI (the former is exactly times that of the latter in all experiments). In summary, NVI is more efficient than VI, while R-NVI is not. In the remaining part of this section, TA() and TA() are solved using NVI in all experiments.
We next evaluate the effectiveness of TA() and TA() for solving IOMDPs. The experiments were carried out in an MDP with . The parameters of the underlying MDP are provided in Appendix F. As shown in Table 2, we consider 4 values of . For each , we solved the associated TA() using NVI and obtained the TA() policy to control the IOMDP with SIRP . We then collected the total discounted reward over a horizon of time steps. The value of each group is the empirical average total discounted reward of independent runs. The results of TA() and TA() were obtained by the same procedure. In all experiments, the policies generated by TA() and TA() are identical; hence their values are equal. We also present the computation time of each method in the table, where the computation time of TA() includes the time of solving TA() for all . The results show that TA() with a small can not generate a satisfactory policy when is small, but we can obtain a better policy by increasing . However, the computation time of TA() increases explosively with . Fortunately, TA() can generate the same policy as TA() using much less computation time.
We carried out more experiments for the last group () of Table 2 to show how and affect the performance of TA() and TA(). Fig. 4(a) shows that the value of the TA() policy increases with and eventually becomes stable after . This verifies our analysis in Section 5—the larger the value of , the TA() policy is optimal for more position states, hence the closer the TA() policy is to the optimal policy. Given an initial position state in , the value difference between the TA() policy and the optimal policy diminishes quickly as increases. The disadvantage of TA() is also obvious: the computation time increases exponentially with . In Fig. 4(b), we fix the value of and examine how the parameter affects the value and the computation time of the TA() policy. Three different are compared. TA() shows excellent performance in this experiment—it achieves the same value as TA() with much less computation time. It is worth noting that the computation time of TA() only increases linearly with , showing good scalability of this method.
The above experiments show that, compared with TA(), TA() is much more inefficient and unscalable. However, we can show that TA() with small is good enough in some scenarios. The experiments were conducted in IOMDPs generated randomly as before. We set the SIRP to be and the initial position state belong to in all these experiments. Table 3 shows the values of multiple TA() policies, where the value of each group is obtained by the same procedure as in Table 2. In these experiments, the performance of TA(2) is (nearly) the same as that of TA(6), implying that TA(2) is already near-optimal. Considering that TA() is easier to implement than TA(), it may be a good choice to use TA() when a small is good enough.
8 Conclusion
This paper studied intermittently observable MDPs. We assumed that the transmissions of state information follow a Bernoulli lossy process. The problem of selecting actions for the MDP in the presence of state information losses has been investigated systematically. Two methods of formulating the IOMDP were used for different purposes. We first formulated the problem as a belief MDP to analyze the effect of state information losses. We proved that the expected total discounted reward is a continuous and increasing function of the state information reception probability. An upper bound for the performance regret caused by state information losses is derived. Then, we reformulated the IOMDP as a tree MDP to develop finite-state approximations, TA() and TA(), for the problem. We showed analytically that TA() could well approximate the optimal value function for part of the states, thus allowing it to generate a near-optimal policy. Inspired by the approximation error bound of TA(), we further developed TA(), which is more efficient than TA() because it excludes the redundant states when computing policies. In addition, we also proposed a nested value iteration algorithm for TA() and TA(). Convergence analysis is provided for interpreting the efficiency of the algorithm. Finally, we validated the effectiveness of the proposed methods by numerical results.
From the perspective of communication, the transmission model of state information depends on the environment and the communication protocol adopted by the system. The Bernoulli process considered in this paper is a proper transmission model if the system adopts a time-division multiple access (TDMA) protocol and operates in a stationary environment. In this case, the SIRP can be easily measured by counting the empirical success rate of a number of transmissions. A natural direction of extending the IOMDP framework studied in this paper is to generalize the transmission model. (1) If the communication environment is time-varying (e.g., the sensor or the controller moves significantly over time), then the SIRP may change dynamically over time. For problems of this kind, we may need to take the worst-case performance into account when making decisions. The structural results established in Section 3 may be exploited to analyze these problems. (2) If the system adopts other protocols, we may need a more complicated transmission model (e.g., a Markov model is usually used for transmissions over WiFi networks). These present interesting questions for future research.
Acknowledgments and Disclosure of Funding
This work was supported in part by the General Research Funds (Project No. 14200221) established under the University Grant Committee of the Hong Kong Special Administrative Region, China.
Appendix A
A.1 Proof of Theorem 2
In this part we prove Theorem 2 from Section 3. For notation simplicity, will stand for the probability of the belief state transitioning from to , i.e.,
[TABLE]
**Proof **[Theorem 2] Consider . Suppose that is an optimal policy for . Let denote the transition matrix of under policy . Then
[TABLE]
Note that policy can also be applied to the belief MDP . We have
[TABLE]
[TABLE]
It follows that
[TABLE]
Note that exists and is non-negative (i.e., all entries are non-negative) because is a stochastic matrix. Specifically,
[TABLE]
Let . Just like , can be viewed as a function on set . For an arbitrary , let . Then we can express in component notation as
[TABLE]
Note that and that is a convex function of . It follows that
[TABLE]
Therefore, for any . The product of a non-negative matrix and a non-negative vector is clearly non-negative. That is,
[TABLE]
It follows immediately that . This completes the proof.
A.2 Proof of Theorem 5
This part proves Theorem 5 from Section 3.
**Proof **[Theorem 5] Consider an arbitrary . Let and denote an optimal policy for . It follows from Theorem 2 that . Therefore,
[TABLE]
where denotes the max norm. Using a similar argument as in (18) yields
[TABLE]
Let . Then for any , let , we have
[TABLE]
Since the reward function is bounded, the value function is also bounded. The convexity of implies that the term within the square brackets is non-negative. As a result,
[TABLE]
is non-negative and bounded. It follows that
[TABLE]
where is the vector with all elements being 1. Clearly, for any ,
[TABLE]
Using a similar argument could show that the above conclusion holds for . This proves the continuity of .
The bound for is obtained by substituting into (21) and (22). Using Lemma 4 yields the expression of presented in Theorem 5.
Appendix B
B.1 Proof of Lemma 6
**Proof ** For any and , let , , and . Note that
[TABLE]
Then by definition, we have
[TABLE]
The Bellman equation for can be written as
[TABLE]
Note also that the Bellman equation for is
[TABLE]
Clearly, (25) and (26) are of the same form for any . Since the Bellman equation has a unique solution, we conclude that if . This proves statement 1. Statement 2 follows immediately: for any satisfying , we have . In addition, statement 1 implies that for all and for any . Then statement 3 can be verified using (25) and (26).
B.2 Proof of Theorem 8
**Proof ** We first construct an auxiliary MDP, denoted by to facilitate the analysis of the approximation error of TA(). In particular, the belief state space , action space , SIRP , reward function , and discount factor are identical to those of . The transition kernel is defined as follows:
- (1)
For any with , , where and ;
- (2)
For any , let and , where . Then
[TABLE]
On the one hand, since and have the same state and action spaces, they share the same set of deterministic stationary policies, say . On the other hand, note that and that the probability of the position state transitioning from to is 0. If the initial position state is , then reduces to . Denote by the value function of under policy and the optimal value function of . Then
[TABLE]
As we will show below, it is more convenient to compare two MDPs with the same state and action spaces. Hence will serve as a bridge that connects and so that we can characterize the approximation error by comparing with .
For any , using the same argument as in (18) yields
[TABLE]
Let and . Then is given by
[TABLE]
The RHS of (27) can be viewed as the expected total discounted reward of a Markov reward process whose reward function is and the Markov chain is governed by the transition kernel . Denote by the Markov reward process. We then can write in component form as
[TABLE]
Substituting (28) into (29) yields
[TABLE]
For any , it can be proved that(see Appendix E for the proof)
[TABLE]
Note that the above quantity is independent of policy . Since for all and , we have
[TABLE]
The inequality follows from the fact that and . Define
[TABLE]
Then for all . Therefore, for any ,
[TABLE]
Now, denote by and the optimal policies for and , respectively. From (31), for any ,
[TABLE]
Putting together (32) and (33) yields
[TABLE]
Since for all , the desired result follows immediately.
Appendix C
In this Appendix we prove theoretical results from Section 5.
C.1 Proof of Lemma 12
**Proof ** Let denote the optimal policy for . According to our construction, and having the same optimal action for every means that
[TABLE]
In addition, each in MDP has only one available action, i.e., .
The quantity is the expected total discounted reward generated by with policy and initial position state . We thus can express as
[TABLE]
For with initial position state and being governed by policy , is a set of redundant position states. Therefore, for any and ,
[TABLE]
Denote by a policy for with for any . Then (35) and (36) implies that
[TABLE]
The second line of (37) is the expected total discounted reward generated by with policy and initial position state . It follows that for all .
We next show for all . Denote by the set of all deterministic policies for . Given the optimal policy , define . Then for any ,
[TABLE]
The desired result follows immediately.
C.2 Proof of Fact 1
Let
[TABLE]
Then being an -optimal action implies that
[TABLE]
Note that . Let . Suppose that , we must have
[TABLE]
where the first and third inequalities follow from (38). The above implies that
[TABLE]
Therefore, if . This completes the proof.
C.3 Proof of Theorem 13
**Proof ** For , define
[TABLE]
Then according to Theorem 8, the approximation error of for is bounded by
[TABLE]
where . The penultimate inequality follows from the fact that decreases with :
[TABLE]
Denote by the optimal policy for . Since the error bound increases monotonically with the layer index, (39) also holds for any . Then according to Fact 1, we have
- (a)
is an optimal action for in , .
- (b)
for , .
- (c)
for , .
Note that (b) follows from (a); (c) was discussed when defining TA(). Clearly, statement (1) of the theorem is a special case of (a).
We next prove statement (2) using (a), (b), and (c). Note that . It follows that for ; that is, and have the same optimal action for every . Then according to Lemma 12, for all . This fact, together with (39) (the case of ), implies that is an optimal action for in . That is, for . Repeating the above argument until the case of yields statement (2).
Appendix D
This Appendix proves Theorem 14 from Section 6.
**Proof ** For any bounded real-valued function , define the operator as
[TABLE]
where . That is, is the operator corresponding to the -th inner iteration within an outer iteration. It is well-known that is a contraction mapping and
[TABLE]
where is the greedy action with respect to . For and , substituting into (40) yields
[TABLE]
It can be proved by a similar argument that the above inequality holds for any and .
Let counts the overall number of inner iterations, then is the result of the -th inner iteration of the -th outer iteration. We have
[TABLE]
Let for an arbitrary . Then
[TABLE]
According to (41), for ,
[TABLE]
For the first term of (42),
[TABLE]
Using the argument repeatedly yields
[TABLE]
For the second term of (42), if . We thus have
[TABLE]
Statement (2) can be obtained from (42)-(44): for ,
[TABLE]
The convergence property follows immediately.
Appendix E
This part proves eq. (30) from the proof of Theorem 8 (Appendix B.2). Given that . For , define
[TABLE]
Since is governed by the transition kernel , we have
[TABLE]
[TABLE]
It follows that, for ,
[TABLE]
Then we can derive
[TABLE]
Given that , it is easy to verify that
[TABLE]
Since , the first term in (50) is 0. It follows from (50) and (51) that
[TABLE]
This completes the proof.
Appendix F
This Appendix provides an introduction to the experimental MDP in Section 7 (Table 2 and Fig. 4).
Consider a system with a car moving in the map shown in Fig. 5. The car has four actions at any position: move left, move right, move up, and move down. The aim is to control the car to keep moving along a clockwise direction in the white region (). The task is terminated once the car enters the gray region (including the gray region in the central part and the gray region in the marginal part).
Given a positive integer , we use to denote the set of integers between 1 and . Mathematically, the MDP is formulated as follows:
- •
State space , where states 1-8 correspond to positions 1-8 in the white region and state 9 corresponds to the gray region (terminated state).
- •
Action space : action 1, move left; action 2, move down; action 3, move right; action 4, move up.
- •
Transition kernel. For any state , denote by the clockwise action of state (that is, move left in states 1 and 2, move down in states 3 and 4, move right in states 5 and 6, move up in states 7 and 8) and the anti-clockwise action. Denote by and the state and action at time , respectively. The transition kernel is defined as follows: (1) for , if and otherwise; (2) for ,
[TABLE]
[TABLE]
[TABLE]
- •
Reward function: if and otherwise.
- •
Discount factor: .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adlakha et al. (2012) S. Adlakha, S. Lall, and A. Goldsmith. Networked markov decision processes with delays. IEEE Transactions on Automatic Control , 57(4):1013–1018, 2012. doi: 10.1109/TAC.2011.2168111.
- 2Araya et al. (2010) M. Araya, O. Buffet, V. Thomas, and F. Charpillet. A POMDP extension with belief-dependent rewards. Advances in neural information processing systems , 23, 2010.
- 3Astrom (1965) K. J. Astrom. Optimal control of markov decision processes with incomplete state estimation. J. Math. Anal. Applic. , 10:174–205, 1965.
- 4Bander and White III (1999) J. L. Bander and C. C. White III. Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society , 50(6):660–668, 1999.
- 5Brooks and Leondes (1972) D. Brooks and C. T. Leondes. Markov decision processes with state-information lag. Operations Research , 20(4):904–907, 1972.
- 6Cassandra et al. (2013) A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable markov decision processes. ar Xiv preprint ar Xiv:1302.1525 , 2013.
- 7Cheng (1988) H.-T. Cheng. Algorithms for partially observable Markov decision processes . Ph D thesis, 1988. URL https://open.library.ubc.ca/collections/ubctheses/831/items/1.0098252 .
- 8Hsu and Marcus (1982) K. Hsu and S. Marcus. Decentralized control of finite state markov processes. IEEE Transactions on Automatic Control , 27(2):426–431, 1982.
