O$^2$TD: (Near)-Optimal Off-Policy TD Learning

Bo Liu; Daoming Lyu; Wen Dong; Saad Biaz

arXiv:1704.05147·cs.LG·April 21, 2017

O$^2$TD: (Near)-Optimal Off-Policy TD Learning

Bo Liu, Daoming Lyu, Wen Dong, Saad Biaz

PDF

Open Access

TL;DR

This paper introduces two novel algorithms for off-policy temporal difference learning that aim to more accurately approximate the true value function, addressing limitations of existing methods and providing near-optimal solutions with linear computational cost.

Contribution

It proposes a batch algorithm for optimal off-policy prediction and a linear-cost near-optimal online algorithm, along with a new perspective on emphatic TD learning.

Findings

01

The batch algorithm effectively approximates the true value function.

02

The online algorithm achieves near-optimal performance with linear computational cost.

03

A new perspective connects off-policy optimality with stability in TD learning.

Abstract

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true value function $V$ . Two novel algorithms are proposed to approximate the true value function $V$ . This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function $V$ . (2) A linear computational cost (per step) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and off-policy stability.

Equations105

V^{π} = T^{π} V^{π} = R^{π} + γ P^{π} V^{π},

V^{π} = T^{π} V^{π} = R^{π} + γ P^{π} V^{π},

Δ := (I - γ P^{π}) Φ = L^{π} Φ.

Δ := (I - γ P^{π}) Φ = L^{π} Φ.

Π_{Φ}^{X} = Φ (X^{⊤} Φ)^{- 1} X^{⊤},

Π_{Φ}^{X} = Φ (X^{⊤} Φ)^{- 1} X^{⊤},

\overset{v}{^} = Π_{Φ}^{X} T^{π} (\overset{v}{^}),

\overset{v}{^} = Π_{Φ}^{X} T^{π} (\overset{v}{^}),

\overset{v}{^} = Π_{Φ}^{X} T^{π} (\overset{v}{^}) = Π_{Φ}^{L^{π ⊤} X} V,

\overset{v}{^} = Π_{Φ}^{X} T^{π} (\overset{v}{^}) = Π_{Φ}^{L^{π ⊤} X} V,

v^{*} = Π_{Φ}^{X^{*}} T v^{*}

v^{*} = Π_{Φ}^{X^{*}} T v^{*}

X^{*} = (L^{π ⊤})^{- 1} ΞΦ.

X^{*} = (L^{π ⊤})^{- 1} ΞΦ.

(L^{π ⊤}) X^{*} = ΞΦ,

(L^{π ⊤}) X^{*} = ΞΦ,

Δ^{⊤} X^{*} = Φ^{⊤} (L^{π ⊤}) ((L^{π ⊤})^{- 1} ΞΦ) = Φ^{⊤} ΞΦ = C .

Δ^{⊤} X^{*} = Φ^{⊤} (L^{π ⊤}) ((L^{π ⊤})^{- 1} ΞΦ) = Φ^{⊤} ΞΦ = C .

E_{π_{b}} [\hat{Δ}] = L^{π} Φ,

E_{π_{b}} [\hat{Δ}] = L^{π} Φ,

E_{π_{b}} [\hat{Δ}]^{⊤} X^{*}

E_{π_{b}} [\hat{Δ}]^{⊤} X^{*}

X^{* ⊤} E_{π_{b}} [\hat{Δ}] θ^{*}

E_{π_{b}} [\hat{Δ}]

E_{π_{b}} [\hat{Δ}]

= a_{i} \sum π_{b} (a_{i} ∣ s_{i}) \frac{π ( a _{i} ∣ s _{i} )}{π _{b} ( a _{i} ∣ s _{i} )} (Δ ϕ_{i})^{⊤}

= a_{i} \sum π (a_{i} ∣ s_{i}) (Δ ϕ_{i})^{⊤}

= L^{π} Φ.

E_{π_{b}} [\hat{Δ}^{⊤}] X^{*}

E_{π_{b}} [\hat{Δ}^{⊤}] X^{*}

= Φ^{⊤} ΞΦ = C .

Φ θ^{*}

Φ θ^{*}

= Φ (X^{* ⊤} Φ)^{- 1} X^{* ⊤} (R + γ Φ^{'} θ^{*})

θ^{*}

X^{* ⊤} Φ θ^{*}

X^{* ⊤} (Φ - γ Φ^{'}) θ^{*}

X^{* ⊤} Δ θ^{*}

\hat{X} = ar g X min \frac{1}{2} ∣∣ \hat{Δ}^{⊤} X - \hat{C} ∣ ∣_{F}^{2} .

\hat{X} = ar g X min \frac{1}{2} ∣∣ \hat{Δ}^{⊤} X - \hat{C} ∣ ∣_{F}^{2} .

\hat{X} = (\hat{Δ} \hat{Δ}^{⊤})^{- 1} (\hat{Δ} C) .

\hat{X} = (\hat{Δ} \hat{Δ}^{⊤})^{- 1} (\hat{Δ} C) .

\hat{θ} = ar g θ min ∣∣ \hat{X}^{⊤} (\hat{Δ} θ - \hat{R}) ∣ ∣_{ξ}^{2} .

\hat{θ} = ar g θ min ∣∣ \hat{X}^{⊤} (\hat{Δ} θ - \hat{R}) ∣ ∣_{ξ}^{2} .

\hat{θ} = (\hat{X}^{⊤} \hat{Δ})^{- 1} \hat{X}^{⊤} R .

\hat{θ} = (\hat{X}^{⊤} \hat{Δ})^{- 1} \hat{X}^{⊤} R .

\hat{X} = ΩΞΦ,

\hat{X} = ΩΞΦ,

\hat{Ω} = ar g Ω min \frac{1}{2} ∣∣ \hat{Δ}^{⊤} ΩΞΦ - \hat{C} ∣ ∣_{F}^{2}, s.t. Ω_{ij} = 0, i \neq = j .

\hat{Ω} = ar g Ω min \frac{1}{2} ∣∣ \hat{Δ}^{⊤} ΩΞΦ - \hat{C} ∣ ∣_{F}^{2}, s.t. Ω_{ij} = 0, i \neq = j .

\hat{Δ}^{⊤} ΩΞΦ = E [ρ_{i} ω_{i} ϕ_{i} Δ ϕ_{i}^{⊤}] .

\hat{Δ}^{⊤} ΩΞΦ = E [ρ_{i} ω_{i} ϕ_{i} Δ ϕ_{i}^{⊤}] .

\forall i, ω_{i}

\forall i, ω_{i}

ω_{i} = ar g ω min ∣∣ ϕ_{i} (ω ρ_{i} Δ ϕ_{i} - ϕ_{i})^{⊤} ∣ ∣_{*} .

ω_{i} = ar g ω min ∣∣ ϕ_{i} (ω ρ_{i} Δ ϕ_{i} - ϕ_{i})^{⊤} ∣ ∣_{*} .

ω_{i} = \frac{Δ ϕ _{i}^{⊤} ϕ _{i}}{ρ _{i} Δ ϕ _{i}^{⊤} Δ ϕ _{i}} .

ω_{i} = \frac{Δ ϕ _{i}^{⊤} ϕ _{i}}{ρ _{i} Δ ϕ _{i}^{⊤} Δ ϕ _{i}} .

θ_{i + 1} = θ_{i} + α_{i} ρ_{i} ω_{i} δ_{i} ϕ_{i} .

θ_{i + 1} = θ_{i} + α_{i} ρ_{i} ω_{i} δ_{i} ϕ_{i} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and ELM

Full text

O2TD: (Near)-Optimal Off-Policy TD Learning

Bo Liu

Auburn University

[email protected] &Daoming Lyu

Auburn University

[email protected] &Wen Dong

University of Buffalo

[email protected] &Saad Biaz

Auburn University

[email protected]

Abstract

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true value function $V$ . Two novel algorithms are proposed to approximate the true value function $V$ . This paper makes the following contributions:

•

A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function $V$ .

•

A linear computational cost (per step) near-optimal algorithm that can learn from a collection of off-policy samples.

•

A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and off-policy stability.

1 Introduction

Temporal difference (TD) learning is a widely used method in reinforcement learning. There are two fundamental problems in temporal difference learning. The first problem is the off-policy stability. Although TD converges when samples are drawn “on-policy” by sampling from the Markov chain underlying a policy in a Markov decision process, it can be shown to be divergent when samples are drawn “off-policy.” Off-policy stable methods are of wider applications since they can learn while executing an exploratory policy, learn from demonstrations, and learn multiple tasks in parallel. The second problem is the optimality with function approximation. An accurate prediction of the value function will greatly help improve the policy optimization, which is the ultimate goal of reinforcement learning tasks. On the other hand, a bad value function prediction will lead to a low-quality policy (Sutton and Barto, 1998).

Several different approaches have been explored to address the problem of off-policy temporal difference learning. Baird’s residual gradient (RG) method (Baird, 1995) is the first approach with linear complexity per step, but it requires double sampling and also converges to an inferior solution. Gordon (1996) proposed the “averager” method, which needs to store many training examples, and thus is not practical for large-scale applications. The off-policy LSTD (Yu, 2010) is off-policy convergent, but its per-step computational complexity is quadratic in the number of parameters $d$ of the function approximator. Sutton et al. (2008, 2009) proposed the family of gradient-based temporal difference (GTD) algorithms which are proven to be asymptotically off-policy convergent using stochastic approximation (Borkar, 2008).

Another direction of temporal difference learning, optimal temporal difference learning, seems to draw relatively insufficient attention. It is well-known that the asymptotic solutions of TD and GTD are not the true value function $V$ , but the solution of a projected fixed point equation (Sutton et al., 2009). On the other hand, the residual gradient method converges to another solution, which is often inferior to the TD solution. However, as pointed out by Scherrer (2010), both the TD and residual gradient method can be unified as the oblique projection of the true value function $V$ with different oblique projection directions, and *neither *of them is optimal in the sense of approximating the true value function $V$ . To the best of our knowledge, the most relevant to our work is the optimal Dantzig Selector TD learning (Liu et al., 2016), which aims to find the best denoising matrix for the purpose of feature selection, when the number of samples $n$ is much larger than the number of function approximation parameters $d$ .

This paper attempts to improve the prediction of value function based on the technique of oblique projection. Here is a roadmap for the rest of the paper. Section 3 introduces the relationship between the optimal approximation of the true value function $V$ with the oblique projected fixed point equations, which reduces the problem to finding the optimal oblique projection direction. Unfortunately, this cannot be directly computed. To this end, Section 4 proposes an approximation criterion and two algorithms, i.e., a state-aggregated batch algorithm and a state-weighted stochastic algorithm. Related work is discussed in Section 5. Section 6 presents the experimental results evaluating the effectiveness of the proposed approaches.

2 Preliminary

Reinforcement Learning (RL) (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998) is a class of learning problems in which an agent interacts with an unfamiliar, dynamic, and stochastic environment, where the agent’s goal is to optimize some measure of its long-term performance. This interaction is conventionally modeled as a Markov decision process (MDP). An MDP is defined as the tuple $({\mathcal{S},\mathcal{A},P_{ss^{\prime}}^{a},R,\gamma})$ , where $\mathcal{S}$ and $\mathcal{A}$ are the sets of states and actions, the transition kernel $P_{ss^{\prime}}^{a}$ specifying the probability of transition from state $s\in\mathcal{S}$ to state $s^{\prime}\in\mathcal{S}$ by taking action $a\in\mathcal{A}$ , $R(s,a):\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is the reward function bounded by $R_{\max}$ ., and $0\leq\gamma<1$ is a discount factor. A stationary policy $\pi:\mathcal{S}\times\mathcal{A}\to\left[{0,1}\right]$ is a probabilistic mapping from states to actions. The main objective of a RL algorithm is to find an optimal policy. In order to achieve this goal, a key step in many algorithms is to calculate the value function of a given policy $\pi$ , i.e., $V^{\pi}:\mathcal{S}\to\mathbb{R}$ , a process known as policy evaluation. It is known that $V^{\pi}$ is the unique fixed-point of the Bellman operator $T^{\pi}$ , i.e.,

[TABLE]

where $R^{\pi}$ and $P^{\pi}$ are respectively the reward function and transition kernel of the Markov chain induced by policy $\pi$ . In Eq. 1, we may think of $V^{\pi}$ as a $|\mathcal{S}|$ -dimensional vector and write everything in vector/matrix form. We also denote $L^{\pi}:=I-\gamma{P^{\pi}}$ . In the following, to simplify the notation, we often drop the dependence of $T^{\pi}$ , $V^{\pi}$ , $R^{\pi}$ , and $P^{\pi}$ to $\pi$ .

We denote by $\pi_{b}$ , the behavior policy that generates the data, and by $\pi$ , the target policy that we would like to evaluate. They are the same in the on-policy setting and different in the off-policy scenario. For each state-action pair $(s_{i},a_{i})$ , such that $\pi_{b}(a_{i}|s_{i})>0$ , we define the importance-weighting factor $\rho_{i}=\pi(a_{i}|s_{i})/\pi_{b}(a_{i}|s_{i})$ with $\rho_{\max}\geq 0$ being its maximum value over the state-action pairs.

When $\mathcal{S}$ is large or infinite, we often use a linear approximation architecture for $V^{\pi}$ with parameters $\theta\in\mathbb{R}^{d}$ and $K$ -bounded basis functions $\{\varphi_{i}\}_{i=1}^{d}$ , i.e., $\varphi_{i}:\mathcal{S}\rightarrow\mathbb{R}$ and $\max_{i}||\varphi_{i}||_{\infty}\leq K$ . We denote by $\phi(\cdot):=\big{(}\varphi_{1}(\cdot),\ldots,\varphi_{d}(\cdot)\big{)}^{\top}$ the feature vector and by $\mathcal{F}$ the linear function space spanned by the basis functions $\{\varphi_{i}\}_{i=1}^{d}$ , i.e., $\mathcal{F}=\big{\{}f_{\theta}\mid\theta\in\mathbb{R}^{d}\;\text{and}\;f_{\theta}(\cdot)=\phi(\cdot)^{\top}\theta\big{\}}$ . We may write the approximation of $V$ in $\mathcal{F}$ in the vector form as $\hat{v}=\Phi\theta$ , where $\Phi$ is the $|\mathcal{S}|\times d$ feature matrix, and we denote

[TABLE]

When only $n$ training samples of the form $\mathcal{D}=\big{\{}\big{(}s_{i},a_{i},r_{i}=r(s_{i},a_{i}),s^{\prime}_{i}\big{)}\big{\}}_{i=1}^{n},\;s_{i}\sim\xi,\;a_{i}\sim\pi_{b}(\cdot|s_{i}),\;s^{\prime}_{i}\sim P(\cdot|s_{i},a_{i})$ , are available ( $\xi$ is a vector representing the probability distribution over the state space $\mathcal{S}$ ), we denote by $\delta_{i}(\theta):=r_{i}+\gamma\phi_{i}^{{}^{\prime}\top}\theta-\phi_{i}^{\top}\theta$ , the TD error for the $i$ -th sample $(s_{i},r_{i},s^{\prime}_{i})$ and define $\Delta\phi_{i}=\rho_{i}(\phi_{i}-\gamma\phi^{\prime}_{i})$ . We also denote the sample-based state-aggregated estimation of $\Delta$ (resp. $R$ ), termed as $\hat{\Delta}$ (resp. $\hat{R}$ ), i.e., given sample set $\mathcal{D}$ , the $i$ -th and $j$ -th samples are aggregated if $s_{i}=s_{j}$ , which is a standard approach used in state aggregation methods (Singh et al., 1995). Finally, we define the matrices $C$ as $C:=\mathbb{E}[\phi_{i}\phi_{i}^{\top}]$ , where the expectations are w.r.t. $\xi$ and $P^{\pi_{b}}$ . We also denote by $\Xi$ , the diagonal matrix whose elements are $\xi(s)$ , and ${\xi_{\max}}:={\max_{s}}\xi(s)$ . For each sample $i$ in the training set $\mathcal{D}$ , the unbiased estimate of $C$ is $\hat{C}_{i}:=\phi_{i}\phi_{i}^{\top}$ .

3 Problem Formulation

This section presents the motivation of this research, i.e., exploring the possible optimal value function approximation in a model-free reinforcement learning setting.

It is evident that given the functional space $\mathcal{F}$ and the approximation of $V$ in $\mathcal{F}$ in the vector form represented as $\hat{v}=\Phi\theta$ , the optimal approximation is $v^{*}=\Pi V$ , where $\Pi=\Phi{({\Phi^{\top}}\Xi\Phi)^{-1}}{\Phi^{\top}}\Xi$ is the weighted least-squares projection weighted by the state distribution $\Xi$ . This is obtained from $\arg\mathop{\min}\limits_{\hat{v}}||\hat{v}-V||_{\xi}^{2}$ . It is also well-known that the TD solution $\hat{v}_{TD}$ does not converge to $v^{*}$ but to the unique fixed-point solution of $\hat{v}=\Pi T\hat{v}$ . Several intuitive questions arise here, such as

What is the approximation error bound between $\hat{v}_{TD}$ and $V$ ? 2. 2.

What is the relation of representation between $\hat{v}_{TD}$ and $V$ , i.e., if $\hat{v}_{TD}$ can be analytically represented by $V$ ?

The first question has been answered in (Tsitsiklis and Van Roy, 1997), where an upper bound was given as $||V-{{\hat{v}}_{TD}}|{|_{\xi}}\leq\frac{1}{{\sqrt{1-{\gamma^{2}}}}}||V-{v^{*}}|{|_{\xi}}.$ The answer to the second question requires the notion of oblique projection defined in Section 3.1.

3.1 Oblique Projection and Optimal Projection

The oblique projection tuple ( $\Phi,X$ ) is defined as follows, where $X$ is a matrix with the same size as $\Phi$ .

Definition. The Oblique Projection operator $\Pi_{\Phi}^{X}$ is defined as

[TABLE]

which specifies a projection orthogonal to $span(X)$ and onto $span(\Phi)$ . It can be easily deducted that the weighted orthogonal projection $\Phi$ can be written as $\Pi=\Pi_{\Phi}^{\Xi\Phi}$ .

It is easy to verify that the projected fixed point equation in temporal difference learning, $\hat{v}=\Pi{T^{\pi}}(\hat{v})$ , can be extended to a more general setting by extending the weighted least-squares projection operator to oblique projection operator as

[TABLE]

It turns out that both TD and RG solutions are oblique projections with different $X$ , where ${X_{TD}}=\Xi\Phi,{X_{RG}}=\Xi L^{\pi}\Phi$ (Scherrer, 2010). One may be interested in the relation between the true value function $V$ and the solutions of the fixed-point equation. The relation is shown in Lemma 1.

Lemma 1.

(Scherrer, 2010)* The solution of the oblique projected fixed-point equation $\hat{v}=\Pi_{\Phi}^{X}T(\hat{v})$ w.r.t the oblique projection $\Pi_{\Phi}^{X}$ can be represented as the oblique projection $\Pi_{\Phi}^{{L^{\pi\top}}X}$ of the true value function $V$ , i.e.,*

[TABLE]

where $L^{\pi}=(I-\gamma P^{\pi})$ .

Proof.

Please refer to Scherrer (2010) for a detailed proof. ∎

Remark: Lemma 1 helps to identify the equivalence between oblique projection of the true value function $V$ , i.e., $\Pi_{\Phi}^{L^{\pi\top}X}V$ and the solution of the oblique projected fixed-point equation, i.e., $\hat{v}=\Pi_{\Phi}^{X}T^{\pi}\hat{v}$ . Figure 1 is an illustration of the oblique projection.

An intuitive question to ask is what the best oblique projection $X$ is. Is it either TD, RG, or some interpolation between them, or none of the above? To answer this question, we present the following proposition, which is the workhorse of this paper.

Lemma 2.

(Scherrer, 2010)*

Given $\Phi$ , if $V$ does not lie in $span(\Phi)$ , the optimal approximation is $v^{*}=\Pi V=\Phi{({\Phi^{\top}}\Xi\Phi)^{-1}}{\Phi^{\top}}\Xi V$ , and the corresponding oblique projection $X^{*}$ in the fixed point equation*

[TABLE]

is

[TABLE]

Proof.

From Lemma 1, we know that $X^{*}$ satisfies ${v^{*}}=\Pi_{\Phi}^{({L^{\pi\top}}){X^{*}}}V$ . Let $\Pi_{\Phi}^{({L^{\pi\top}}){X^{*}}}=\Pi$ , we have

[TABLE]

and thus we can have Eq. (7), which completes the proof. ∎

Although the analytical formulation of $X^{*}$ is clear, it is intractable to compute. The major reason is that ${(L^{\pi\top})^{-1}}$ is computational prohibitive since the exact $P^{\pi}$ is not known. This paper will present techniques to compute $X^{*}$ approximately in the following.

4 Algorithm Design

Given the knowledge of the oblique projection and the problem of the computational intractability to compute $X^{*}$ , a criterion is proposed to approximate $X^{*}$ . Based on this criteria, two algorithms are proposed. The first is based on state-aggregated two-stage approximation, and the second is based on state-dependent diagonalized approximation.

4.1 Approximate Criteria

Before presenting the algorithm design, we first introduce a simple but important property of the optimal projection matrix $X^{*}$ . Notice since $X^{*}={(L^{\pi\top})^{-1}}\Xi\Phi$ , and thus we have

[TABLE]

Motivated by this, Proposition 1 is presented to formulate the cornerstone of this paper.

Proposition 1.

For state aggregated $\hat{\Delta}$ , there is

[TABLE]

and thus for the optimal oblique projection $X^{*}$ and the corresponding $v^{*}=\Phi\theta^{*}$ , the following holds

[TABLE]

Proof.

Eq. (10) is derived as follows,

[TABLE]

Then we have

[TABLE]

Insert Eq. (10), $\mathbb{E}_{\xi}[\hat{C}]=C$ , and $\mathbb{E}_{\xi}[\hat{R}]=R$ into Eq. (6), there is

[TABLE]

This completes the proof. ∎

4.2 Two-Stage State-Aggregated Batch Algorithm

Motivated by Proposition 2, the following two-stage near-optimal off-policy TD algorithm is proposed, where the first step is

[TABLE]

This problem is a well-defined convex problem, and there exists a unique solution. When $(\hat{\Delta}\hat{\Delta}^{\top})$ is nonsingular, the closed-form least-squares solution is computed as

[TABLE]

On the other hand, if $({\hat{\Delta}}{\hat{\Delta}}^{\top})$ is singular, which is more general, Eq. (15) can be solved via gradient descent method and can be further accelerated by Nesterov’s accelerated gradient method (Nesterov, 2004). The second step is to compute $\hat{\theta}$ , i.e.,

[TABLE]

This is a well-defined convex problem, and the solution is unique and can be easily solved via gradient descent method. When $({\hat{X}^{\top}}\hat{\Delta})$ is nonsingular, $\hat{X}$ can be simply solved via the one-shot least-squares solution

[TABLE]

Based on these, we propose the State-aggregated Optimal TD Algorithm (SOTD) as follows.

4.3 State-Dependent Optimal Off-Policy TD learning

Algorithm 1 can find the near-optimal projection matrix, however, there is an apparent drawback of computing $\hat{X}$ in this way because of computational complexity. Note that $\hat{X}$ is a $|\mathcal{S}|\times d$ matrix, which is computationally costly in large-scale reinforcement learning problems where the number of states $|\mathcal{S}|$ is large, or in continuous state space. To this end, the following algorithm is designed to tackle difficulties mentioned above.

In real applications where $d\ll|\mathcal{S}|$ or the state space is continuous, the proposed algorithm would not work well in practice since it has to compute a $|\mathcal{S}|\times d$ matrix $\hat{X}$ . A desirable way out is to approximate the $(L^{\pi\top})^{-1}$ with a diagonal matrix $\Omega$ , such that each row of $\Omega$ does not depend on other states, but only on its corresponding state. With such an assumption, $\hat{X}$ can be represented via a product of matrices as follows,

[TABLE]

where $\Omega$ is a $|\mathcal{S}|\times|\mathcal{S}|$ diagonal matrix. The $i$ -th diagonal entry of $\Omega$ is denoted as $\omega_{i}$ , i.e., $\Omega_{ii}:=\omega_{i}$ . We term $\Omega$ as “state-dependent” diagonal matrix. Then the optimization problem reduces to

[TABLE]

It is easy to prove the following

[TABLE]

Based on the assumption that $\omega_{i}$ should be only (current) state-dependent, we have the following relaxed objective function, i.e., for the $i$ -th sample,

[TABLE]

Trace norm minimization can also be used, i.e.,

[TABLE]

Two issues arise here:

•

Computational cost. Trace norm minimization is usually more computationally expensive since it involves the singular value decomposition (SVD) operation.

•

Choice of the norm. The issue here is to select the best norm as the objective function. Although there is already several pieces of literature discussing this problem, however, it remains unclear that at first glance, which norm minimization would achieve the best result in our problem.

We will resolve these two concerns by scrutinizing the structure of the problem. Notice that Eq. (22) can be written as ${\omega_{i}}=\arg\mathop{\min}\limits_{\omega}||{\phi_{i}}{(\omega{\rho_{i}}\Delta{\phi_{i}}-{\phi_{i}})^{\top}}||_{F}^{2}$ . Since ${\phi_{i}}{(\omega{\rho_{i}}\Delta{\phi_{i}}-{\phi_{i}})^{\top}}$ is a rank- $1$ matrix, the solution is identical w.r.t Frobenius norm and trace norm, and the closed-form solution is

[TABLE]

Interested readers will find a detailed deduction in the Appendix. The update law is thus as follows,

[TABLE]

Samples with zero importance ratio (i.e., $\rho_{i}=0$ ) are discarded. Now it is ready to formulate the Optimal Off-Policy TD Algorithm (O2TD) algorithm as in Algorithm 2. It is easy to verify that the computational cost per step is $O(d)$ , as can be seen from the computation of Eq. (24) and (25).

5 Related Work

One of the related work to optimal temporal difference learning is the emphatic temporal difference learning (ETD) work by Sutton et al. (2015). That work was motivated by the off-policy convergence issue, and we will shed new light on the algorithms from the optimality perspective. Similar to O2TD, ETD also assumes that the optimal projection $X^{*}$ can be approximated by the product of diagonal matrices $\Omega,\Xi$ and the $\Phi$ matrices, i.e., the near-optimal projection matrix is formulated as in Eq. (19). Then a different technique is used based on the power series expansion, i.e.,

[TABLE]

Then the power series expansion is used to compute $\Omega\Xi$ as a whole. Since the optimal oblique projection matrix is ${X^{*}}={({L^{\pi\top}})^{-1}}\Xi\Phi$ , it is evident that $\hat{X}=\Omega\Xi\Phi$ should be as close as possible to $X^{*}$ , especially the diagonal elements. The diagonal elements of $\hat{X}$ are represented as a (column) vector $f$ . One conjecture is that for the diagonal matrix of $\hat{X}$ , it is desired that $f={({L^{\pi\top}})^{-1}}{\xi}$ . By using the power series expansion, $f$ can be expanded as

[TABLE]

Readers familiar with the emphatic TD learning algorithm know that this is actually identical to Equation (13) in the paper by Sutton et al. (2015), where a scalar follow-on trace is computed as 111We use subscript $\bullet_{t}$ to denote sequential samples, and subscription $\bullet_{i}$ to denote samples that are randomly sampled with replacement.

[TABLE]

and it turns out that

[TABLE]

which will lead to the standard emphatic TD([math]) algorithm,

[TABLE]

Due to space limitations, we refer interested readers to (Sutton et al., 2015) for more details of the algorithm, and (Hallak et al., 2015; Yu, 2015) for more theoretical analysis. It should also be noted that although this section does not provide any further extension of the ETD algorithm regarding algorithm design and analysis, to the best of our knowledge, it is the first time associating the ETD algorithm with near optimal temporal difference learning. This sheds a helpful light in understanding the family of the emphatic TD learning algorithms and the design of the follow-on trace. However, the ETD algorithm requires sequential sampling condition, i.e., $s^{\prime}_{t}=s_{t+1},\forall t>0$ , which is not suitable for a set of samples collected from many episodes.

6 Experimental Study

This section evaluates the effectiveness of the proposed algorithms. The effectiveness of SOTD algorithm is illustrated via comparison to LSTD, which is also a batch TD algorithm. A comparison study of O2TD is conducted with GTD2 and ETD as three off-policy convergent TD algorithms with linear computational cost per step.

6.1 Experimental Study of SOTD

The effectiveness of the proposed SOTD algorithm is shown by comparing the performance on the $400$ -state Random MDP domain (Dann et al., 2014) with LSTD (Bradtke and Barto, 1996; Boyan, 1999) algorithm, which is one of the most sample-efficient algorithms to the best of our knowledge. Two widely used measurements in TD learning, Mean-Squares Projected Bellman Error (MSPBE) (Sutton et al., 2009; Dann et al., 2014) and Mean-Squares Error (MSE) are used as the error measurements.

This domain is a randomly generated MDP with $400$ states and $10$ actions (Dann et al., 2014). The transition probabilities are defined as $P(s^{\prime}|s,a)\propto p_{ss^{\prime}}^{a}+{10^{-5}}$ , where $p_{ss^{\prime}}^{a}\sim U[0,1]$ . The behavior policy $\pi_{b}$ , the target policy $\pi$ as well as the start distribution are sampled in a similar manner. Each state is represented by a $201$ -dimensional feature vector, where $200$ of the features were sampled from a uniform distribution, and the last feature was a constant one, the discount factor is set to $\gamma=0.95$ . The number of features $d=200$ , and we compare the performance of LSTD and SOTD with different numbers of training samples $n$ , as shown in Figure 2. As Figure 2 shows, with relatively small sample size $n$ , SOTD tends to be even more sample-efficient than the LSTD algorithm.

6.2 Experimental Study of O2TD

This section compares the previous GTD2, ETD method with the O2TD method using various domains with regard to their value function approximation performances. It should be mentioned that since the major focus of this paper is value function approximation and thus comparisons on control learning performance are not reported in this paper. We use $\alpha_{\rm{E}}$ , $\alpha_{\rm{O}}$ , and $\alpha_{\rm{G}}$ to denote the stepsizes for ETD, O2TD, and GTD2, respectively. Root Mean-Squares Projected Bellman Error (RMSPBE) and Root Mean-Squares Error (RMSE) are used for better visualization.

6.2.1 Baird Domain

The Baird example (Baird, 1995) is a well-known example to test the performance of off-policy convergent algorithms. Constant stepsize $\alpha_{\rm{O}}=0.006$ , $\alpha_{\rm{G}}=0.005$ , which are chosen via comparison studies as in (Dann et al., 2014). The Monte-Carlo estimation of true value function $V$ is conducted as in (Dann et al., 2014). Figure 3 shows the RMSPBE curve and RMSE curve of GTD2, O2TD of $5000$ steps averaged over $20$ runs. As can be seen from Figure 3, although the variance of O2TD is larger than GTD2’s, O2TD has a significant improvement over the GTD2 algorithm wherein the RMSPBE, the RMSE and the variance are all substantially reduced. The low variance of the GTD2 learning curve can be explained by the advantage of stochastic gradient against stochastic approximation method, as explained in Liu et al. (2015).

6.2.2 $400$ -State Random MDP

The randomly generated MDP with $400$ states and $10$ actions used in Section 6.1 is adopted as the second task. For sequential sampling (Figure 4), constant stepsize $\alpha_{\rm{E}}=3*10^{-6}$ , $\alpha_{\rm{O}}=0.0007$ , $\alpha_{\rm{G}}=0.002$ . For random sampling (Figure 5), constant stepsize $\alpha_{\rm{E}}=2*10^{-6}$ , $\alpha_{\rm{O}}=0.0006$ , $\alpha_{\rm{G}}=0.0009$ . The Monte-Carlo estimation of true value function $V$ is conducted as in (Dann et al., 2014). ETD tends to diverge easily with large stepsizes on this domain, so $\alpha_{\rm{E}}$ is set to be very small. As Figure 4 and Figure 5 show, O2TD performs overall the best on this domain, although the variance is relatively larger than GTD2’s.

6.2.3 Mountain Car

This section uses the mountain car example to evaluate the validity of the proposal algorithm. The mountain car MDP is an optimal control problem with a continuous two-dimensional state space. The steep discontinuity in the value function makes learning difficult. The Fourier basis (Konidaris et al., 2011) is used, which is a kind of fixed basis set. An empirically good policy $\pi$ was obtained first, then we ran this policy $\pi$ to collect trajectories that comprise the dataset. On-policy policy evaluation of $\pi$ is then conducted using the collected samples. For sequential sampling, constant stepsize $\alpha_{\rm{E}}=0.001$ , $\alpha_{\rm{O}}=0.1$ , $\alpha_{\rm{G}}=0.2$ . For random sampling, constant stepsize $\alpha_{\rm{E}}=0.0002$ , $\alpha_{\rm{O}}=0.05$ , $\alpha_{\rm{G}}=0.06$ . The Monte-Carlo estimation of $V$ is estimated via $100$ runs. As Figure 6 and Figure 7 show, GTD2 appears to perform the worst on this domain, and O2TD tends to converge faster than ETD.

7 Conclusion

This paper proposes an interesting question:

•

How to improve the approximation quality of the true value function $V$ ?

To this end, several algorithms are proposed that can apply to different scenarios. Empirical experimental studies solidify the effectiveness of the proposed algorithm with different learning settings.

The major contribution is not to propose another new TD algorithm with linear computational complexity per step, but to make an attempt to explore the optimal prediction of the value function in model-free policy evaluation. There are numerous promising future work potentials along this direction of research. One possible future research is to explore the relation between the near optimal projection matrix with eligibility traces and if the combination can improve the value function prediction performance in integration. Another interesting direction is that the current computationally tractable criteria of computing $X^{*}$ are based on Proposition 1 and the power series expansion of $(L^{\pi})^{-1}$ , it would be very intriguing to explore if there exist other computationally tractable criteria.

Appendix

Details of Eq. (24)

To obtain Eq. (24), we first introduce the following Lemmas to compute the singular value of rank- $1$ matrices.

Lemma 3.

A rank- $1$ real square matrix $G=pq^{\top}$ where $p,q$ are vectors of the same length, the eigenvalues of $G$ are

[TABLE]

i.e., $G$ has only one nonzero eigenvalue ${p^{\top}}q$ , and all other eigenvalues are [math], and thus we also have

[TABLE]

where ${\rm{Tr}}(\cdot)$ is the trace of a matrix.

Based on Lemma 3, we introduce Lemma 4.

Lemma 4.

A rank- $1$ real matrix (not necessarily to be square) $M=uv^{\top}$ has only one nonzero singular value $\sigma_{\max}(M)=||u||_{2}\cdot||v||_{2}$ , where $||\cdot||_{2}$ is the $\ell_{2}$ -norm of a vector, and the Frobenius norm and the trace norm of $M$ are identical, i.e.,

[TABLE]

Proof.

We use $M^{H}$ to represent the conjugate transpose of the $M$ matrix, and $\lambda(\cdot)$ to represent the eigenvalues of a square matrix, and $\lambda(\cdot)$ to represent the nonzero eigenvalue of a matrix. Then we have

[TABLE]

From Lemma 3, we know that $\lambda(vv^{\top})$ are $\{{v^{\top}}v,0,0,\cdots\}$ , and thus $M$ has only one nonzero singular value ${\sigma_{\max}}(M)$ , which is

[TABLE]

and all other singular values of $M$ are [math]. Thus $||M||_{*}=||M||_{F}=||u||_{2}\cdot||v||_{2}$ , which completes the proof. ∎

Based on Lemma 4, we now show the derivation of Eq. (24). To tackle the following trace norm minimization formulation,

[TABLE]

we need to utilize the structure of the rank- $1$ matrices. We have

[TABLE]

we denote $q_{i}(\omega):={(\omega{\rho_{i}}\Delta{\phi_{i}}-{\phi_{i}})^{\top}}$ , and thus we have

[TABLE]

The second equality comes based on Eq. (34), and the third equality is based on the fact that $||{\phi_{i}}||_{2}$ does not depend on $\omega$ . This is equivalent to the following,

[TABLE]

On the other hand, if we use $||\cdot||^{2}_{F}$ instead of trace norm minimization as in Eq. (36), we have

[TABLE]

And since

[TABLE]

The first equality comes from that for a matrix $M$ , there is

[TABLE]

The third equality comes from Eq. (39). Then we can see that problem (40) is also equivalent to Eq. (39), as verified by Lemma 4.

By taking the gradient of the right hand-side of Eq. (39), we will have Eq. (24) as the final result.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Baird [1995] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning , pages 30–37, 1995.
2Bertsekas and Tsitsiklis [1996] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming . Athena Scientific, Belmont, Massachusetts, 1996.
3Borkar [2008] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint . Cambridge University Press, 2008.
4Boyan [1999] J. A. Boyan. Least-squares temporal difference learning. In Proceedings of the 16th International Conference on Machine Learning , pages 49–56. Morgan Kaufmann, San Francisco, CA, 1999.
5Bradtke and Barto [1996] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning , 22(1-3):33–57, 1996.
6Dann et al. [2014] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research , 15:809–883, 2014.
7Gordon [1996] G. J. Gordon. Stable fitted reinforcement learning. Advances in neural information processing systems , pages 1052–1058, 1996.
8Hallak et al. [2015] A. Hallak, A. Tamar, R. Munos, and S. Mannor. Generalized emphatic temporal difference learning: Bias-variance analysis. ar Xiv preprint ar Xiv:1509.05172 , 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

O2TD: (Near)-Optimal Off-Policy TD Learning

Abstract

1 Introduction

2 Preliminary

3 Problem Formulation

3.1 Oblique Projection and Optimal Projection

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

4 Algorithm Design

4.1 Approximate Criteria

Proposition 1**.**

Proof.

4.2 Two-Stage State-Aggregated Batch Algorithm

4.3 State-Dependent Optimal Off-Policy TD learning

5 Related Work

6 Experimental Study

6.1 Experimental Study of SOTD

6.2 Experimental Study of O2TD

6.2.1 Baird Domain

6.2.2 400400400-State Random MDP

6.2.3 Mountain Car

7 Conclusion

Appendix

Details of Eq. (24)

Lemma 3**.**

Lemma 4**.**

Proof.

Lemma 1.

Lemma 2.

Proposition 1.

6.2.2 $400$ -State Random MDP

Lemma 3.

Lemma 4.