Expected Sarsa($\lambda$) with Control Variate for Variance Reduction

Long Yang; Yu Zhang; Jun Wen; Qian Zheng; Pengfei Li; Gang Pan

arXiv:1906.11058·cs.LG·September 9, 2019

Expected Sarsa($\lambda$) with Control Variate for Variance Reduction

Long Yang, Yu Zhang, Jun Wen, Qian Zheng, Pengfei Li, Gang Pan

PDF

Open Access

TL;DR

This paper introduces a variance reduction technique for off-policy reinforcement learning algorithms using control variates in Expected Sarsa(λ), resulting in lower variance and improved convergence properties compared to existing methods.

Contribution

The paper proposes the ES(λ)-CV algorithm with control variates for variance reduction and extends it to GES(λ) for convergence with linear function approximation.

Findings

01

ES(λ)-CV has lower variance than Expected Sarsa(λ).

02

GES(λ) achieves a convergence rate of O(1/T).

03

Numerical experiments show better performance than state-of-the-art algorithms.

Abstract

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to $Expected$ $Sarsa$ ( $λ$ ) and propose a tabular $ES$ ( $λ$ )- $CV$ algorithm. We prove that if a proper estimator of value function reaches, the proposed $ES$ ( $λ$ )- $CV$ enjoys a lower variance than $Expected$ $Sarsa$ ( $λ$ ). Furthermore, to extend $ES$ ( $λ$ )- $CV$ to be a convergent algorithm with linear function approximation, we propose the $GES$ ( $λ$ ) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of $GES$ ( $λ$ )…

Tables1

Table 1. Table 1: Convergence Rate of Gradient Temporal Difference Learning

Algorithm	Reference	Step-size	Convergence Rate
$𝚃𝙳 (0)$	(?)	$α_{t} = 𝒪 (\frac{1}{t^{η}})$ , $η \in (0, 1)$	$𝒪 (1 / \sqrt{T})$
$𝚃𝙳 (0)$	(?)	$\sum_{t = 1}^{\infty} α_{t} = \infty$	$𝒪 (e^{- \frac{σ}{2} T^{1 - η}} + \frac{1}{T^{η}})$
$𝙶𝚃𝙳 (0)$	(?)	$\sum_{t = 1}^{\infty} α_{t} = \infty$ , $\frac{β_{t}}{α_{t}} \to 0$	$𝒪 ({(1 / T)}^{\frac{1 - κ}{3}})$
$𝙶𝚃𝙳$	(?)	constant step-size	$𝒪 (1 / \sqrt{T})$
$𝙶𝚃𝙳$	(?)	$\sum_{t = 1}^{\infty} α_{t} = \infty$ , $\frac{\sum_{t = 1}^{T} α_{t}^{2}}{\sum_{t = 1}^{T} α_{t}} \leq \infty$	$𝒪 (1 / \sqrt{T})$
$𝙶𝚃𝙱 / 𝙶𝚁𝚎𝚝𝚛𝚊𝚌𝚎$	(?)	$α_{t}, β_{t} = 𝒪 (\frac{1}{t})$	$𝒪 (1 / T)$
Ours		constant step-size	$𝒪 (1 / T)$

Equations269

B^{π} q^{π} = q^{π},

B^{π} q^{π} = q^{π},

B^{π} : q \mapsto R + γ P^{π} q,

B^{π} : q \mapsto R + γ P^{π} q,

P_{s s^{^{'}}}^{π} = a \in A \sum π (a ∣ s) P_{s s^{^{'}}}^{a}, R (s, a) = R_{s}^{a} .

P_{s s^{^{'}}}^{π} = a \in A \sum π (a ∣ s) P_{s s^{^{'}}}^{a}, R (s, a) = R_{s}^{a} .

Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α_{t} δ_{t},

Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α_{t} δ_{t},

δ_{t}^{S} = def R_{t + 1} + γ Q_{t + 1} - Q_{t},

δ_{t}^{S} = def R_{t + 1} + γ Q_{t + 1} - Q_{t},

δ_{t}^{ES} = R_{t + 1} + E_{π} [Q (S_{t + 1}, \cdot)] - Q_{t},

δ_{t}^{ES} = R_{t + 1} + E_{π} [Q (S_{t + 1}, \cdot)] - Q_{t},

G_{t}^{λ, ES} = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{t + n},

G_{t}^{λ, ES} = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{t + n},

G_{t}^{λ, ES} = R_{t + 1} + γ [(1 - λ) \overset{ˉ}{Q}_{t + 1} + λ G_{t + 1}^{λ, ES}] .

G_{t}^{λ, ES} = R_{t + 1} + γ [(1 - λ) \overset{ˉ}{Q}_{t + 1} + λ G_{t + 1}^{λ, ES}] .

R_{t : t + n} = t = 0 \sum n γ^{t} (P^{μ})^{t} R_{t + 1} + γ^{n + 1} (P^{μ})^{n} P^{π} Q .

R_{t : t + n} = t = 0 \sum n γ^{t} (P^{μ})^{t} R_{t + 1} + γ^{n + 1} (P^{μ})^{n} P^{π} Q .

(1 - λ) n = 0 \sum \infty λ^{n} R_{t : t + n} = ((1 - λ) B^{π} + λ B^{μ}) Q,

(1 - λ) n = 0 \sum \infty λ^{n} R_{t : t + n} = ((1 - λ) B^{π} + λ B^{μ}) Q,

G_{t}^{λ ρ, ES} = R_{t + 1} + γ [(1 - λ) \overset{ˉ}{Q}_{t + 1} + λ ρ_{t + 1} G_{t + 1}^{λ ρ, ES}],

G_{t}^{λ ρ, ES} = R_{t + 1} + γ [(1 - λ) \overset{ˉ}{Q}_{t + 1} + λ ρ_{t + 1} G_{t + 1}^{λ ρ, ES}],

E_{μ} [G_{t}^{λ ρ, ES} ∣ (S_{t}, A_{t}) = (s, a)] = q^{π} (s, a) .

E_{μ} [G_{t}^{λ ρ, ES} ∣ (S_{t}, A_{t}) = (s, a)] = q^{π} (s, a) .

G_{t}^{λ ρ, ES}

G_{t}^{λ ρ, ES}

\displaystyle+\underbrace{{\bar{Q}_{t+1}-\rho_{t+1}Q_{t+1}}}_{\text{control variate}})\Big{]},

E_{μ} [\overset{ˉ}{Q}_{t + 1} - ρ_{t + 1} Q_{t + 1}] = 0

E_{μ} [\overset{ˉ}{Q}_{t + 1} - ρ_{t + 1} Q_{t + 1}] = 0

G_{t}^{λ ρ, ES}

G_{t}^{λ ρ, ES}

= Q_{t} + l = t \sum \infty (γ λ)^{l - t} δ_{l}^{ES} ρ_{t + 1 : l} .

t = 0 \sum h (γ λ)^{t} δ_{t}^{ES} ρ_{1 : t},

t = 0 \sum h (γ λ)^{t} δ_{t}^{ES} ρ_{1 : t},

B_{λ}^{π} q

B_{λ}^{π} q

= (a) q + (I - λγ P^{π})^{- 1} (B^{π} q - q),

Q_{k + 1} = B_{λ}^{π} Q_{k} .

Q_{k + 1} = B_{λ}^{π} Q_{k} .

\displaystyle\|Q_{k}-q^{\pi}\|\leq\big{(}\frac{\gamma-\lambda\gamma}{1-\lambda\gamma}\big{)}^{k}\|Q_{0}-q^{\pi}\|.

\displaystyle\|Q_{k}-q^{\pi}\|\leq\big{(}\frac{\gamma-\lambda\gamma}{1-\lambda\gamma}\big{)}^{k}\|Q_{0}-q^{\pi}\|.

\displaystyle\mathbb{V}{\emph{ar}}\big{[}\widetilde{G}_{t}^{\lambda\rho,\emph{ES}}\big{]}=

\displaystyle\mathbb{V}{\emph{ar}}\big{[}\widetilde{G}_{t}^{\lambda\rho,\emph{ES}}\big{]}=

\displaystyle+\gamma^{2}\lambda^{2}\mathbb{V}{\emph{ar}}\big{[}v^{\pi}(s^{{}^{\prime}})-\bar{Q}_{t+1}\big{]}

+ γ^{2} λ^{2} V ar [Δ_{t + 1}]

\displaystyle+\gamma^{2}\lambda^{2}\mathbb{V}{\emph{ar}}\big{[}\rho_{t+1}\widetilde{G}_{t+1}^{\lambda\rho,\emph{ES}}\big{]},

\overset{ˉ}{Q}_{t + 1} - ρ_{t + 1} Q_{t + 1} \approx 0, - v^{π} (s^{^{'}}) + ρ_{t + 1} q^{π} (s^{^{'}}, a^{^{'}}) \approx 0.

\overset{ˉ}{Q}_{t + 1} - ρ_{t + 1} Q_{t + 1} \approx 0, - v^{π} (s^{^{'}}) + ρ_{t + 1} q^{π} (s^{^{'}}, a^{^{'}}) \approx 0.

for ES (λ) - CV iteration (\ref es-recursive-cv) V ar [Δ_{t + 1}] ≪ for ES (λ) iteration (\ref off-es-recursive) V ar [- v^{π} (s^{^{'}}) + ρ_{t + 1} q^{π} (s^{^{'}}, a^{^{'}})] .

for ES (λ) - CV iteration (\ref es-recursive-cv) V ar [Δ_{t + 1}] ≪ for ES (λ) iteration (\ref off-es-recursive) V ar [- v^{π} (s^{^{'}}) + ρ_{t + 1} q^{π} (s^{^{'}}, a^{^{'}})] .

θ_{t + 1}

θ_{t + 1}

= θ_{t} + α_{t} (l = t \sum \infty (γ λ)^{l - t} δ_{l, θ}^{ES} ρ_{t + 1 : l}) ϕ_{t},

E [θ_{t + 1} ∣ θ_{t}] = θ_{t} + α_{t} (A θ_{t} + b),

E [θ_{t + 1} ∣ θ_{t}] = θ_{t} + α_{t} (A θ_{t} + b),

A

A

b

E [θ_{t + 1} ∣ θ_{t}] = def

E [θ_{t + 1} ∣ θ_{t}] = def

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Control Systems Optimization · Advanced Bandit Algorithms Research

Full text

Expected Sarsa( $\lambda$ ) with Control Variate for Variance Reduction

Long Yang, Yu Zhang, Jun Wen, Qian Zheng, Pengfei Li, Gang Pan

Department of Computer Science, Zhejiang University

{yanglong,hzzhangyu,junwen,qianzheng,pfl,gpan}@zju.edu.cn

Abstract

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ) and propose a tabular $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ algorithm. We prove that if a proper estimator of value function reaches, the proposed $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ). Furthermore, to extend $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ to be a convergent algorithm with linear function approximation, we propose the $\mathtt{GES}$ ( $\lambda$ ) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of $\mathtt{GES}$ ( $\lambda$ ) achieves $\mathcal{O}(1/T)$ , which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: $\mathtt{GQ}$ ( $\lambda$ ), $\mathtt{GTB}$ ( $\lambda$ ) and $\mathtt{ABQ}$ ( $\zeta$ ).

Introduction

Off-policy learning is powerful for reinforcement learning due to it learns the target policy from the data generated by another policy (?). However, suffering high variance is a critical challenge for off-policy learning (?), which roots in the discrepancy of distribution between target policy and behavior policy. The resources of high variance of off-policy learning can be divided into two parts, (I) one is tabular case which has to do with the target of the update, (II) one is with function approximation which has to do with the distribution of the update (?).

In this paper, we mainly focus on the variance reduce technique to an important off-policy algorithm: $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ). We introduce control variate to $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ) and propose $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ) with control variate ( $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ ) for the tabular case. The control variate method is one of the most effective variance reduction techniques in statistical inference (?). Control variate is an additional term that has zero expectation, which implies introducing control variate does not change the expectation of update. Thus, learning with control variate does not introduce any biases, but it is potential to enjoy much lower variance (?; ?; ?). Sutton and Barto (?) (section 12.9) firstly introduces control variate to $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ), but their analysis is limited in linear function approximation. Later, De Asis and Sutton (?) further introduce control variate to multi-step TD learning, but it constrains on off-line learning (which is extremely expensive for training).

Despite being easy to implement, competitive to the state of the art methods, and being used in practice, in RL, the TD learning with control variate technique lacks a robust theoretical analysis. In this paper, we focus on the theoretical analysis of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ . We prove that the tabular $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ converges at an exponential fast for off-policy evaluation without biases. Furthermore, we analyze all the random sources lead to the variance of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ , and we prove that if a proper estimator of value function reaches, $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ).

Furthermore, we show the variance reduction way presented by (?) (section 12.9) to extend $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ with function approximation is unstable. Although this instability has been realized by Sutton and Barto (?), it is only an intuitive guess inspired previous works (?; ?). In this paper, we provide a simple but rigorous theoretical analysis to illustrate the instability appears in (?). We also demonstrate this instability by a typical example.

To extend the $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ with function approximation be a convergent and stable algorithm, we propose $\mathtt{GES}$ ( $\lambda$ ) algorithm under the the convex-concave saddle-point formulation (?). We prove the convergence rate of $\mathtt{GES}$ ( $\lambda$ ) achieves $\mathcal{O}(1/T)$ , where $T$ is the number of iterations. Our $\mathcal{O}(1/T)$ matches or outperforms extensive state-of-art works (?; ?; ?; ?; ?; ?), with a more relaxed condition than theirs. Besides, we prove the results of convergence rate without the assumption that the objective is strongly convex in the primal space and strongly concave in the dual space (?).

Finally, we conduct numerical experiments to show that the proposed algorithm is stable and converges faster with lower variance than lots of state-of-art gradient-based TD learning algorithms: $\mathtt{GQ}$ ( $\lambda$ ) (?), $\mathtt{GTB}$ ( $\lambda$ ) (?), and $\mathtt{ABQ}$ ( $\zeta$ ) (?).

Contributions

•

We introduce control variate technique to $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ) and propose a tabular $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ algorithm. We prove that if a proper estimator of value function reaches, the proposed $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$ ( $\lambda$ ).

•

We propose the $\mathtt{GES}$ ( $\lambda$ ), which extends $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ to be a convergent algorithm with linear function approximation. We prove that the convergence rate of $\mathtt{GES}$ ( $\lambda$ ) achieves $\mathcal{O}(1/T)$ , which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition.

Preliminary and Some Notations

In this section, we introduce some necessary notations about reinforcement learning, temporal difference learning and $\lambda$ -return. For the limitation of space, we more discussions about $\lambda$ -return in Appendix A and B.

Reinforcement Learning The reinforcement learning (RL) is often formalized as Markov decision processes (MDP) (?) which considers 5-tuples form $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)$ . $\mathcal{S}$ is the set contains all states, $\mathcal{A}$ is the set contains all actions. $\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ , $P_{ss^{{}^{\prime}}}^{a}=\mathcal{P}(S_{t}=s^{{}^{\prime}}|S_{t-1}=s,A_{t-1}=a)$ is the probability for the state transition from $s$ to $s^{{}^{\prime}}$ under taking the action $a$ . $\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{1}$ , $\mathcal{R}_{s}^{a}=\mathbb{E}[R_{t+1}|S_{t}=s,A_{t}=a]$ . $\gamma\in(0,1)$ is the discount factor.

A policy is a probability distribution on $\mathcal{S}\times\mathcal{A}$ . Target policy $\pi$ is the policy will be learned and behavior policy $\mu$ is used to generate behavior. $\tau=\{S_{t},A_{t},R_{t+1}\}_{t\geq 0}$ denotes a trajectory, where $A_{t}\sim\mu(\cdot|S_{t})$ and $S_{t+1}\sim\mathcal{P}(\cdot|S_{t},A_{t})$ . For a given policy $\pi$ , its state-action value function $q^{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s,A_{t}=a]$ , state value function $v^{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]$ , where $G_{t}=\sum_{k=0}^{\infty}\gamma^{k}R_{k+t+1}$ and $\mathbb{E}_{\pi}[\cdot|\cdot]$ denotes an conditional expectation on all actions which be selected according to $\pi$ . It is known that $q^{\pi}(s,a)$ is the unique fixed point (?) of Bellman operator $\mathcal{B}^{\pi}$ ,

[TABLE]

which is known as Bellman equation, where

[TABLE]

$P^{\pi}$$\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{S}|}$ and $R$$\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}$ , the corresponding elements of $P^{\pi}$ and $R$ are:

[TABLE]

TD Learning Temporal difference (TD) learning (?) is one of the most important methods to solve model-free RL (in which, we cannot get $\mathcal{P}$ ). For the trajectory $\tau$ , TD learning is defined as, $\forall~{}t\geq 0$

[TABLE]

where $Q(\cdot,\cdot)$ is an estimate of $q^{\pi}$ , $\alpha_{t}$ is step-size and $\delta_{t}$ is TD error. Let $Q_{t}\overset{\text{def}}{=}Q(S_{t},A_{t})$ , if $\delta_{t}$ is

[TABLE]

above update (2) is $\mathtt{Sarsa}$ algorithm (?). If $\delta_{t}$ is

[TABLE]

update (2) is $\mathtt{Expected~{}Sarsa}$ (?), where $\mathbb{E}_{\pi}[Q(S_{t+1},\cdot)]=\sum_{a\in\mathcal{A}}\pi(a|S_{t+1})Q(S_{t+1},a)$ . If $\pi$ is reduced to greedy policy, then $\mathtt{Expected~{}Sarsa}$ reduces to $\mathtt{Q\text{-}learning}$ (?).

Expected Sarsa $(\lambda)$ The standard forward view of $\lambda$ -return (?) of on-policy $\mathtt{Expected~{}Sarsa}$ is defined as follows,

[TABLE]

where $G_{t}^{t+n}=\sum_{i=0}^{n-1}\gamma^{i}R_{t+i+1}+\gamma^{n}\bar{Q}_{t+n}$ is $n$ -step return of $\mathtt{Expected~{}Sarsa}$ , and $\bar{Q}_{t+n}=\mathbb{E}_{\pi}[Q(S_{t+n},\cdot)]$ . We can write $G_{t}^{\lambda,\text{ES}}$ recursively as follows (the detail is provided in Appendix A),

[TABLE]

Now, we introduce an unbiased 111 How to define the $\lambda$ -return of $\mathtt{Expected~{}Sarsa}$ for off-policy learning? Can we follow the way of (4) straightforwardly? Unfortunately, for the off-policy, the above idea cannot converge to $q^{\pi}$ . In fact, $n$ -step return of $\mathtt{Expected~{}Sarsa}$ is sampled according to

$R_{t:t+n}=\sum_{t=0}^{n}\gamma^{t}(P^{\mu})^{t}R_{t+1}+\gamma^{n+1}(P^{\mu})^{n}P^{\pi}Q.$

Then according to (4), we define the $\lambda$ -return of $\mathtt{Expected~{}Sarsa}$ as follows,

$(1-\lambda)\sum_{n=0}^{\infty}\lambda^{n}R_{t:t+n}=((1-\lambda)\mathcal{B}^{\pi}+\lambda\mathcal{B}^{\mu})Q,$

which converges to $(1-\lambda)q^{\pi}+\lambda q^{\mu}\neq q^{\pi}$ . This is the fixed point of $(1-\lambda)\mathcal{B}^{\pi}+\lambda\mathcal{B}^{\mu}\neq\mathcal{B}^{\pi}$ and it is a biased estimate of $q^{\pi}$ .

recursive $\lambda$ -return of $\mathtt{Expected~{}Sarsa}$ for off-policy learning,

[TABLE]

where $\rho_{t+1}=\pi(A_{t+1}|S_{t+1})/\mu(A_{t+1}|S_{t+1})$ is importance sampling. Eq.(6) firstly appears in (?; ?), but in which it is limited in function approximation. We develop (6) to be a general version which is conducive to the theoretical analysis of the following paragraph. The following Proposition 1 illustrates that $G_{t}^{\lambda\rho,\text{ES}}$ (6) is an unbiased estimate of $q^{\pi}$ .

Proposition 1.

Let $\mu$ and $\pi$ be the behavior and target policy, respectively. For the $\lambda$ -return (6), we have

[TABLE]

For the limitation of space, more discussions about $\lambda$ -return of Sarsa, Eq.(5)-(6), and the proof of Proposition 1 are provided in Appendix A and B.

Expected Sarsa( $\lambda$ ) with Control Variate

In this section, we firstly define $\mathtt{Expected~{}Sarsa}$ ( $\lambda$ ) with control variate (we use $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ for short). Then, prove its linear convergence rate of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ for policy evaluation. Finally, we analyze the variance of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ .

ES( $\lambda$ )-CV Algorithm

We define $\mathtt{Expected~{}Sarsa}$ ( $\lambda$ ) with control variate $\widetilde{G}_{t}^{\lambda\rho,\text{ES}}$ as follows

[TABLE]

where the additional term $\bar{Q}_{t+1}-\rho_{t+1}Q_{t+1}$ is called control variate (CV). The following fact

[TABLE]

implies that $\widetilde{G}_{t}^{\lambda\rho,\text{ES}}$ (7) extends $G_{t}^{\lambda\rho,\text{ES}}$ (6) without introducing biases.

Theorem 1 (Forward View of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ ).

Let $\rho_{t:k}=\prod_{i=t}^{k}\rho_{i}$ denote the cumulated importance sampling from time $t$ to $k$ , and we use $\rho_{t+1:t}=1$ for convention. The recursive $\lambda$ -return in Eq.(7) is equivalent to the following forward view: let $\delta_{l}^{\emph{ES}}$ be the TD error defined in (3), $G_{t}^{t}=Q_{t}$ , $G_{t}^{t+n}=R_{t+1}+\gamma(\rho_{t+1}G_{t+1}^{t+n}+\bar{Q}_{t+1}-\rho_{t+1}Q_{t+1})$

[TABLE]

Proof.

See Appendix C. ∎

Remark 1.

Eq.(8) illustrates that for a given finite horizon trajectory $\{S_{t},A_{t},R_{t+1}\}_{t=0}^{h}$ , the total update (7) reaches

[TABLE]

which is off-line update of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ .

Policy Evaluation

For policy evaluation, our goal is to estimate $q^{\pi}$ according to the trajectory collection $\mathcal{T}=\{\tau_{k}\}_{k\in\mathbb{N}}$ , where $\tau_{k}=\{S_{t},A_{t},R_{t+1}\}_{t\geq 0}\sim\mu$ , $S_{t},A_{t}$ , and $R_{t+1}$ are dependent on the index $k$ strictly, and we omit coefficient $k$ to tight the expression without ambiguity.

The following $\lambda$ -operator $\mathcal{B}^{\pi}_{\lambda}$ is a high level view of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ (8), and it is helpful for us to introduce policy evaluation algorithm. $\forall~{}q\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|},t\geq 0$

[TABLE]

where $\mathcal{B}^{\pi}$ is defined in Eq.(1). We provide the equivalence (a) in Appendix D.

Theorem 2 (Policy Evaluation).

For any initial $Q_{0}$ , consider the trajectory $\mathcal{T}$ generated by $\mu$ , and the following $Q_{k}$ is generated according to the $k$ -th trajectory $\tau_{k}\in\mathcal{T}$ , $k\geq 1$ ,

[TABLE]

By iterating over $k$ trajectories, the upper-error of policy evaluation is bounded by

[TABLE]

Proof.

See Appendix E. ∎

Remark 2.

The forward view (off-line update) of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ (8) can be seen as sampled according to $Q_{t+1}=\mathcal{B}^{\pi}_{\lambda}Q_{t}$ . For any $\gamma\in(0,1),\lambda\in[0,1]$ , then $\frac{\gamma-\lambda\gamma}{1-\lambda\gamma}\in(0,1)$ , thus Eq.(13) implies (8) converges to $q^{\pi}$ at a linear convergence rate.

Variance Analysis

Theorem 3 (Variance Analysis of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ ).

Consider a single trajectory $\tau_{k}$ with ffinite horizon $H+1$ , let $S_{t}=s,A_{t}=a,S_{t+1}=s^{{}^{\prime}},A_{t+1}=a^{{}^{\prime}}$ , $\mathbb{V}{\emph{ar}}\big{[}\widetilde{G}_{H+1}^{\lambda\rho,\emph{ES}}\big{]}=0$ . The variance of $\widetilde{G}_{t}^{\lambda\rho,\emph{ES}}$ is given recursively as follows,

[TABLE]

where $\Delta_{t+1}=\bar{Q}_{t+1}-\rho_{t+1}Q_{t+1}-v^{\pi}(s^{{}^{\prime}})+\rho_{t+1}q^{\pi}(s^{{}^{\prime}},a^{{}^{\prime}})$ .

Proof.

See Appendix F. ∎

Now, let’s illustrate the significance of Eq.(14).

(I) It demonstrates total random sources lead to the variance. The first 3 terms reveal the variance of $\widetilde{G}_{t}^{\lambda\rho,\text{ES}}$ is cased by the following factors correspondingly: the error of one-step $\mathtt{Expected~{}Sarsa}$ for policy evaluation, the error between $\bar{Q}_{t+1}$ and true value $v^{\pi}$ , and state-action transition randomness. The last term in (14) is the variance of future time.

(II) Please notice that if the CV term $\bar{Q}_{t+1}-\rho_{t+1}Q_{t+1}$ (in $\Delta_{t+1}$ ) vanishes, i.e. $\Delta_{t+1}=-v^{\pi}(s^{{}^{\prime}})+\rho_{t+1}q^{\pi}(s^{{}^{\prime}},a^{{}^{\prime}})$ , Eq.(14) is reduced to the recursive variance of $G_{t}^{\lambda\rho,\text{ES}}$ (6). Thus, by Eq.(14), comparing the variance of $\widetilde{G}_{t}^{\lambda\rho,\text{ES}}$ with $G_{t}^{\lambda\rho,\text{ES}}$ is equal to comparing the variance of $\Delta_{t+1}$ .

Furthermore, if a good estimator of $q^{\pi}$ is available, the two following events happen:

For $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ , the term $\Delta_{t+1}\approx 0$ . Since for a proper estimate of $q^{\pi}$ , the following happens

[TABLE] 2. 2.

While, for $\mathtt{ES}$ ( $\lambda$ ), $\Delta_{t+1}=-v^{\pi}(s^{{}^{\prime}})+\rho_{t+1}q^{\pi}(s^{{}^{\prime}},a^{{}^{\prime}})$ , which is never be to [math], no matter how good an estimate of $q^{\pi}$ we achieve.

Thus, if a good estimator of $q^{\pi}$ is available, we have,

[TABLE]

Thus $\widetilde{G}_{t}^{\lambda\rho,\text{ES}}$ enjoys a lower variance than ${G}_{t}^{\lambda\rho,\text{ES}}$ .

Numerical Analysis

We use an experiment to verify that CV is efficient to reduce variance of $\mathtt{ES}$ ( $\lambda$ ) for off-policy evaluation task. In this experiment, the target policy $\pi$ is greedy policy, the value of $\pi$ is selected by $\mathtt{Q\text{-}learning}$ with $\epsilon_{k}$ -greedy policy, where $\epsilon_{k}$ is decayed as $\epsilon_{k+1}=0.95\epsilon_{k}$ , $\epsilon_{1}=0.2$ . After 150 episodes, $\epsilon_{150}\approx 0$ , and the value of target policy $\pi$ comes around $-20$ . We use $0.2$ -greedy policy as behavior policy $\mu$ . All algorithms use step-size $\alpha_{k}=0.5$ and $\lambda=0.95$ .

Gradient Expected Sarsa( $\lambda$ )

In this section, we extend $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ with linear function approximation. Firstly, we prove the way to extend $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ with function approximation by (?) (section 12.9) is unstable. Then, we propose a convergent gradient $\mathtt{Expected~{}Sarsa}$ ( $\lambda$ ).

The Bellman equation (1) cannot be solved directly by tabular method for a large dimension of $\mathcal{S}$ . We often use a parametric function to approximate $q^{\pi}(s,a)\approx\phi^{\top}(s,a)\theta={Q}_{\theta}(s,a),$ where $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{p}$ is a feature map. Then ${Q}_{\theta}$ can be rewritten as a version of matrix ${Q}_{\theta}=\Phi\theta\approx q^{\pi},$ where $\Phi$ is a $|\mathcal{S}||\mathcal{A}|\times p$ matrix whose row is $\phi(s,a)$ . We assume that Markov chain induced by behavior policy $\mu$ is ergodic (?), i.e. there exists a stationary distribution $\xi$ such that $\forall(S_{0},A_{0})\in\mathcal{S}\times\mathcal{A}$ , $\frac{1}{n}\sum_{k=1}^{n}P(S_{k}=s,A_{k}=a|S_{0},A_{0})\overset{n\rightarrow\infty}{\rightarrow}\xi(s,a).$ We denote $\Xi$ as a $|\mathcal{S}|\times|\mathcal{A}|$ diagonal matrix whose diagonal element is $\xi(s,a)$ .

Instability of ES( $\lambda$ ) with Function Approximation

A typical update to extend (8) has been presented in (?) (section 12.9),

[TABLE]

where $\alpha_{t}$ is step-size, $\delta_{l,\theta}^{\text{ES}}=R_{l+1}+\gamma\theta^{\top}_{l}\mathbb{E}_{\pi}[\phi(S_{l+1},\cdot)]-\theta^{\top}_{l}\phi_{l}$ , $\phi_{l}$ is short for $\phi(S_{l},A_{l})$ . Once the system (15) has reached a stable state, for any $\theta_{t}$ , the expected parameter can been written as

[TABLE]

where

[TABLE]

If the system (16) converges, then $\theta_{t}$ converges to the TD fixed point $\theta^{*}$ that satisfies $A\theta^{*}+b=0.$

What condition guarantees the convergence of the (15)/ (16)? Unfortunately, the instability of (15) for off-policy is firstly realized by Sutton and Barto(?), but it is only an intuitive guess inspired by previous works. Now, we provide a simple but rigorous theoretical analysis to illustrate the divergence of Eq.(15). It is known that for on-policy learning $\mu=\pi$ , $A$ is a negative definite matrix (?). Thus, for on-policy learning, (15) converges to $-A^{-1}b$ . However, for off-policy learning, since the steady state-action distribution does not match the transition probability and $P^{\pi}\xi\neq\xi$ , which results in, there is no guarantee that $A$ is a negative definite matrix (?). Thus (15) may diverge.

An Unstable Example Now, we use a typical example (?) to illustrate the instability of iteration (15). The state transition of the example is presented in Figure 2. After some simple algebra (the detail is provided in Appendix G), we have $A=\begin{pmatrix}\frac{6\gamma-\gamma\lambda-5}{2(1-\gamma\lambda)}&0\\ \frac{3\gamma}{2}&-\frac{5}{2}\end{pmatrix}$ . For any $\theta_{0}=(\theta_{0,1},\theta_{0,2})^{\top}$ , a positive constant step-size $\alpha$ , according to (16), we have

[TABLE]

For any $\lambda\in(0,1)$ , $\gamma\in(\frac{5}{6-\lambda},1)$ , $\frac{6\gamma-\gamma\lambda-5}{2(1-\gamma\lambda)}$ is a positive scalar. Since then $A$ cannot be a negative matrix. Furthermore, according to (20),

[TABLE]

Convergent Algorithm

The above discussion of the instability for off-policy learning shows that we should abandon the way presented in (15). In this section, we propose a convergent gradient $\mathtt{ES}$ ( $\lambda$ ) algorithm.

We solve the problem by mean square projected Bellman equation (MSPBE) (?),

[TABLE]

where $\Pi=\Phi(\Phi^{T}\Xi\Phi)^{-1}\Phi^{T}\Xi$ is an $|\mathcal{S}|\times|\mathcal{S}|$ projection matrix. Furthermore, MSPBE $(\theta,\lambda)$ can be rewritten as,

[TABLE]

where $M=\mathbb{E}[\phi_{t}\phi_{t}^{\top}]=\Phi^{T}\Xi\Phi$ . The derivation of (22) is provided in Appendix H.

The computational complexity of the invertible matrix $M^{-1}$ is at least $\mathcal{O}(p^{3})$ (?), where $p$ is the dimension of feature space. Thus, it is too expensive to use gradient updates to solve the problem (22) directly. Besides, as pointed out in (?; ?), we cannot get an unbiased estimate of $\nabla_{\theta}\text{MSPBE}(\theta,\lambda)=A^{\top}M^{-1}(A\theta+b)$ . In fact, since the update law of gradient involves the product of expectations, the unbiased estimate cannot be obtained via a single sample. It needs to sample twice, which is a double sampling problem. Secondly, $M^{-1}=\mathbb{E}[\phi_{t}\phi_{t}^{T}]^{-1}$ cannot also be estimated via a single sample, which is the second bottleneck of applying stochastic gradient method to solve problem (22).

A practical way is converting (22) to be a convex-concave saddle-point problem (?). For $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , its convex conjugate (?) function $f^{*}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is defined as

[TABLE]

By $(\frac{1}{2}\|x\|^{2}_{M})^{*}=\frac{1}{2}\|y\|^{2}_{M^{-1}}$ , we have $\frac{1}{2}\|y\|^{2}_{M^{-1}}=\max_{\omega}(y^{T}\omega-\frac{1}{2}\|\omega\|^{2}_{M}).$ Thus, (22) is equivalent to the next convex-concave saddle-point problem

[TABLE]

It is easy to see that if $(\theta^{*},\omega^{*})$ is the solution of problem (23), then $\theta^{*}=\arg\min_{\theta}\text{MSPBE}(\theta,\lambda)$ . In fact, let $\omega^{*}=\arg\max_{\omega}{(A\theta+b)^{\top}\omega-\frac{1}{2}\|\omega\|_{M}^{2}}$ , then $\omega^{*}=M^{-1}(A\theta+b)$ . Taking $\omega^{*}$ into (23), then (23) is reduced to $\min_{\theta}\frac{1}{2}\|A\theta+b\|^{2}_{M^{-1}}$ , which illustrates that the solution of (22) contained in (23). Gradient update is a natural way to solve problem (23) (ascending in $\omega$ and descending in $\theta$ ) as follows,

[TABLE]

where $\alpha_{t},\beta_{t}$ is step-size, $t\geq 0$ .

Stochastic On-line Implementation However, since $A,b$ , and $M$ are versions of expectations, for model-free RL, we can not get the probability of transition. A practical way is to find the unbiased estimators of them. Let $e_{0}=0,\rho_{t}=\frac{\pi(A_{t}|S_{t})}{\mu(A_{t}|S_{t})},e_{t}=\lambda\gamma\rho_{t}e_{t-1}+\phi_{t},\hat{b}_{t}=R_{t+1}e_{t},\hat{A}_{t}=e_{t}(\gamma\mathbb{E}_{\pi}[\phi(S_{t+1,\cdot})]-\phi_{t})^{\top},\hat{M}_{t}=\phi_{t}\phi^{\top}_{t}$ . By Theorem 9 in (?), we have

[TABLE]

Replacing the expectations in (24) and (25) by corresponding unbiased estimates, we define the stochastic on-line implementation of (24) and (25) as follows,

[TABLE]

More details are summarized in Algorithm 1.

Convergence Analysis

We measure the convergence rate of problem (23) by primal-dual gap error (?). Let

[TABLE]

the primal-dual gap error at each solution $(\omega,\theta)$ is

[TABLE]

Theorem 4 (Convergence of Algorithm 1).

Consider the sequence $\{(\theta_{t},\omega_{t})\}_{t=1}^{T}$ generated by (27), step-size $\alpha,\beta$ are positive constants. Let $(\theta^{*},\omega^{*})$ be the optimal solution of (23), $\bar{\theta}_{T}=\frac{1}{T}(\sum_{t=1}^{T}\theta_{t})$ , $\bar{\omega}_{T}=\frac{1}{T}(\sum_{t=1}^{T}\omega_{t})$ and we choose the step-size $\alpha,\beta$ satisfy $1-\sqrt{\alpha\beta}\|A\|_{*}>0$ , where $\|A\|_{*}=\sup_{\|x\|=1}\|Ax\|$ is operator norm. If parameter $(\theta,\omega)$ is on a bounded $D_{\theta}\times D_{\omega}$ , i.e diam $D_{\theta}=\sup\{\|\theta_{1}-\theta_{2}\|;\theta_{1},\theta_{2}\in D_{\theta}\}\leq\infty$ , diam $D_{\omega}$$\leq\infty$ , $\mathbb{E}[\epsilon_{\Psi}(\bar{\theta}_{T},\bar{\omega}_{T})]$ is upper bounded by:

[TABLE]

Proof.

See Appendix I. ∎

Remark 3.

Theorem 4 illustrates (I) when $\alpha=\beta=\mathcal{O}(\frac{1}{\sqrt{T}})$ , then the overall convergence rate of $\mathbb{E}[\epsilon_{\Psi}(\bar{\theta}_{T},\bar{\omega}_{T})]$ is $\mathcal{O}(\frac{1}{\sqrt{T}})$ , which reaches the worst rate of black box oriented sub-gradient methods (?); (II) when $\alpha=\beta=\mathcal{O}(1)$ , a positive scalar, then $\mathbb{E}[\epsilon_{\Psi}(\bar{\theta}_{T},\bar{\omega}_{T})]=\mathcal{O}(\dfrac{1}{T}).$

Related Works and Comparison

Liu et al.(?) firstly derives $\mathtt{GTD}$ via convex-concave saddle-point formulation, and they prove the convergence rate reaches $\mathbb{E}[\epsilon_{\Psi}(\tilde{\theta}_{T},\tilde{\omega}_{T})]=\mathcal{O}(\frac{1}{\sqrt{T}})$ , where $\tilde{\theta}_{T}$ is Polyak-average: $\tilde{\theta}_{T}=\dfrac{\sum_{t=1}^{T}\alpha_{t}\theta_{t}}{\sum_{t=1}^{T}\alpha_{t}}$ , $\tilde{\omega}_{T}=\dfrac{\sum_{t=1}^{T}\alpha_{t}\omega_{t}}{\sum_{t=1}^{T}\alpha_{t}}$ . Their $\mathtt{GTD}$ requires each $\theta_{t},\omega_{t}$ is projected into the space $D_{\theta},D_{\omega}$ . Later, Wang et al.(?) extends the work of Liu et al.(?), they suppose the data is generated from Markov processes rather than I.I.D assumption. Wang et al.(?) prove the convergence rate $\mathbb{E}[\epsilon_{\Psi}(\tilde{\theta}_{T},\tilde{\omega}_{T})]=\mathcal{O}(\frac{\sum_{t=1}^{T}\alpha^{2}_{t}}{\sum_{t=1}^{T}\alpha_{t}})$ , the best convergence rate reaches $\mathcal{O}(\frac{1}{\sqrt{T}})$ , where the step-size satisfies $\sum_{t=1}^{\infty}\alpha_{t}=\infty$ , $\frac{\sum_{t=1}^{T}\alpha^{2}_{t}}{\sum_{t=1}^{T}\alpha_{t}}\leq\infty$ and $(\tilde{\theta}_{T},\tilde{\omega}_{T})$ is also Polyak-average, the same as (?). Besides, the $\mathtt{GTD}$ of Wang et al.(?) also require projecting the parameter into the space $D_{\theta},D_{\omega}$ .

Both Polyak-averaging and projection make the implementation of gradient TD learning more difficult. Comparing with (?; ?) , our $\mathtt{GES(\lambda)}$ removes Polyak-averaging and projection, while reaches a faster convergence rate.

Recently, (?) proves $\mathtt{GTD(0)}$ family (?; ?) converges at $\mathcal{O}((\frac{1}{T})^{\frac{1-\kappa}{3}})$ , but nerve reach $\mathcal{O}(\frac{1}{T})$ , where $\kappa\in(0,1)$ . Nathaniel and Prashanth (?) proves $\mathtt{TD(0)}$ (?) converges at $\mathcal{O}(\frac{1}{\sqrt{T}})$ with step-size $\alpha_{t}=\mathcal{O}(\frac{1}{t^{\eta}})$ , where $\eta\in(0,1)$ . Then, Dalal et al.(?) further explores the property of $\mathtt{TD(0)}$ , and they prove the convergence rate achieves $\mathcal{O}(e^{-\frac{\sigma}{2}T^{1-\eta}}+\frac{1}{T^{\eta}})$ , but never reach $\mathcal{O}(\frac{1}{T})$ , where $\eta\in(0,1)$ , $\sigma$ is the minimum eigenvalue of the matrix $A^{\top}+A$ .

Comparing to the all above works, we improve the optimal convergence rate to $\mathcal{O}(\dfrac{1}{T})$ with a more relaxed step-size than theirs. Besides, although the $\mathtt{GTB}(\lambda)$ / $\mathtt{GRetrace}(\lambda)$ (?) reaches the same convergence rate as ours, their result depends on a decay step-size.

More details of the convergence rate of gradient temporal difference learning are summarized in Table 1.

Experiments

In this section, we employ three typical domains to test the capacity of $\mathtt{GES}(\lambda)$ for off-policy evaluation, Mountaincar, Baird Star (?), and Two-state MDP (?). We compare $\mathtt{GES}(\lambda)$ with the three state-of-art algorithms: $\mathtt{GQ}(\lambda)$ (?), $\mathtt{ABQ}(\zeta)$ (?), $\mathtt{GTB}(\lambda)$ (?). We choose the above three methods as baselines due to they are all learning by expected TD-error $\delta_{t}^{\text{ES}}$ , which is the same as $\mathtt{GES}(\lambda)$ . For the limitation of space, we present some details of the experiments in Appendix J.

The Effect of Step-size

In this section, we verify the convergence result presented in Theorem 4/Remark 3. We use the empirical

[TABLE]

to evaluate the performance of all the algorithms, where we evaluate $\hat{A}$ , $\hat{b}$ , and $\hat{M}$ according to their unbiased estimates by Monte Carlo method with 5000 episodes. Particular, for Mountaincar, to collect the samples, we run $\mathtt{Sarsa}$ with $p=128$ features to obtain a stable policy. Then, we use this policy to collect trajectories that comprise the samples.

Figure 3 shows the comparison of the empirical MSPBE performance between a constant step-size and the decay step-size $\frac{1}{\sqrt{t}}$ . Result (in Figure 3) illustrates that the $\mathtt{GES}(\lambda)$ with a proper constant step-size converges significantly faster than the learning with step-size $\frac{1}{\sqrt{t}}$ , which support our theory analysis in Remark 3.

Comparison of Empirical MSPBE

The MSPBE distribution is computed over the combination of step-size, $(\alpha_{k},\frac{\beta_{k}}{\alpha_{k}})\in[0.1\times 2^{j}|j=-10,-9,\cdots,-1,0]^{2}$ , and we set $\lambda=0.99$ , $\zeta=0.95$ for $\mathtt{ABQ}(\zeta)$ .. All the result showed in Figure 4 is an average of 100 runs.

Result in Figure 4 shows that our $\mathtt{GES}(\lambda)$ learns significantly faster with better performance than $\mathtt{GQ}(\lambda)$ , $\mathtt{ABQ}(\zeta)$ and $\mathtt{GTB}(\lambda)$ in all domains. Besides, $\mathtt{GES}(\lambda)$ converges with a lower variance. We also notice that although Touati et al(?) claim their $\mathtt{GTB}(\lambda)$ reaches the same convergence rate as our $\mathtt{GES}(\lambda)$ , result in Figure 4 shows that our $\mathtt{GES}(\lambda)$ outperforms their $\mathtt{GTB}(\lambda)$ siginificantly.

Comparison of Empirical MSE

We use the following empirical MSE according to (?),

[TABLE]

where $q^{\pi}$ is estimated by simulating the target policy and averaging the discounted cumulative rewards overs trajectories. The combination of step-size for MSE is the same as previous empirical MSPBE. All the result showed in Figure 5 is an average of 100 runs and we set $\lambda=0.99$ , $\zeta=0.95$ for $\mathtt{ABQ}(\zeta)$ .

The result in Figure 5 shows that $\mathtt{GES}(\lambda)$ converges significantly faster than all the three baselines with lower variance in Mountaincar domain. For the Two-state MDP and Baird domain, $\mathtt{GES}(\lambda)$ also achieves a better performance. This conclusion further verifies the effectiveness of the proposed $\mathtt{GES}(\lambda)$ .

Conclusion

In this paper, we introduce control variate technique to $\mathtt{Expected~{}Sarsa}$ ( $\lambda$ ) and propose $\mathtt{ES}(\lambda)\text{-}\mathtt{CV}$ algorithm. We analyze all the random sources lead to the variance of $\mathtt{ES}(\lambda)\text{-}\mathtt{CV}$ . We prove that if a good estimator of value function achieves, the $\mathtt{ES}(\lambda)\text{-}\mathtt{CV}$ enjoys a lower variance than Expected Sarsa( $\lambda$ ) without control variate. Then, we extend $\mathtt{ES}(\lambda)\text{-}\mathtt{CV}$ to be a convergent algorithm with function approximation and propose $\mathtt{GES}(\lambda)$ algorithm. We prove that the convergence rate of $\mathtt{GES}(\lambda)$ achieves $\mathcal{O}(1/T)$ , which matches or outperforms several state-of-art gradient-based algorithms, but we use a more relaxed step-size. Finally, we use numerical experiments to demonstrate the effectiveness of the proposed algorithm. Results show that the proposed algorithm converges faster and with lower variance than three typical algorithms $\mathtt{GQ}$ ( $\lambda$ ), $\mathtt{GTB}$ ( $\lambda$ ) and $\mathtt{ABQ}$ ( $\zeta$ ).

Appendix A Appendix A: $\lambda$ -Return of Sarsa for Off-policy Learning

For the discussion of off-policy learning, we need the background of importance sampling. Thus, the basic common conclusion about importance sampling (IS) and pre-decision importance sampling (PDIS) (?) is necessary.

Off-Policy Learning via Importance Sampling

Usually, we require that every action taken by $\pi$ is also taken by $\mu$ , which is often called coverage (?) in reinforcement learning.

Assumption 1 (Coverage).

$\forall~{}(s,a)\in\mathcal{S}\times\mathcal{A}$ , we require that $\pi(a|s)>0\Rightarrow\mu(a|s)>0$ .

The difficulty of off-policy roots in the discrepancy between target policy $\pi$ and behavior policy $\mu$ —-we want to learn the target policy while we only get the data generated by behavior policy. One technique to hand this discrepancy is importance sampling (IS) (?). Let $\tau_{t}^{h}=\{S_{t},A_{t},R_{t+1}\}_{t\geq 0}^{h}$ be a trajectory with finite horizon $h<\infty$ . Let $\rho_{t:k}=\prod_{i=t}^{k}\rho_{i}$ denote the cumulated importance sampling ratio, where $\rho_{i}=\frac{\pi(A_{i}|S_{i})}{\mu(A_{i}|S_{i})}$ and $k\leq h$ . Let $G_{t}^{h}=\sum_{k=0}^{h-t-1}\gamma^{k}R_{k+t+1}$ , under Assumption 1 the IS estimator $G_{t}^{\text{IS}}=\rho_{t:h-1}G^{h}_{t}$ is a unbiased estimation of $q^{\pi}$ . However, it is known that IS estimator suffers from large variance of the product $\rho_{t:h-1}$ (?). Pre-decision importance sampling (PDIS) (?) $G_{t}^{\text{PDIS}}=\sum_{k=0}^{h-t-1}\gamma^{k}\rho_{t:t+k}R_{t+k+1}$ is a practical variance reduction method without introducing bias, i.e. $\mathbb{E}_{\mu}[G_{t}^{\text{PDIS}}|S_{t}=s,A_{t}=a]=q^{\pi}(s,a)$ .

[TABLE]

For the equation $\mathbb{E}_{\mu}[G_{t}^{\text{IS}}]=\mathbb{E}_{\mu}[G_{t}^{\text{PDIS}}]$ , please see(?) or section 5.9 in (?).

Lemma 1 (Section 3.10, (?); Section 5.9, (?)).

Let $\tau_{t}^{h}=\{S_{k},A_{k},R_{k+1}\}_{k=t}^{h}$ be the trajectory generated by behavior policy $\mu$ , for a given policy $\pi$ and under Assumption 1, the following holds,

[TABLE]

Lemma 1 implies that for any time $t+k~{}(k\geq 0)$ , the importance sampling factors after $t+k$ have no effect in the expectation, thus the following holds: for all $k\geq 0$ ,

[TABLE]

$\lambda$ -Return of Sarsa

The $\lambda$ -return (?) is an average contains all the $n$ -step return by weighting proportionally to $\lambda^{n-1}$ , $\lambda\in[0,1]$ . For example, let $G_{t}^{t+n}=\sum_{i=0}^{n-1}\gamma^{i}R_{t+i+1}+\gamma^{n}Q_{t+n}$ be $n$ -step return, then the standard forward view of Sarsa $(\lambda)$ is $G_{t}^{\lambda,\text{S}}=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{t+n}$ , which is equivalent to the following recursive version

[TABLE]

We only discuss the case of off-policy learning. On-Policy is a particular case of off-policy learning if $\rho_{t}=1$ . One version of $\lambda$ -return of off-policy Sarsa $(\lambda)$ via importance sampling is defined as the following recursive iteration (Section 12.8, (?)):

[TABLE]

The next Proposition 2 gives a forward view of Eq.(30), and $G_{t}^{\lambda\rho,\text{S}}$ is an unbiased estimate of $q^{\pi}$ .

Proposition 2.

Let $\mu$ be behavior policy and $\pi$ be the target policy. $G_{t}^{t}=Q_{t}$ , $G_{t}^{t+n}=\rho_{t}(R_{t+1}+\gamma G_{t+1}^{t+n})$ , and $G_{t}^{\lambda\rho}=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{t+n}$ , then $G_{t}^{\lambda\rho}$ is equivalent to $G_{t}^{\lambda\rho,\emph{S}}$ defined in Eq.(30). Furthermore, $\mathbb{E}_{\mu}[G_{t}^{\lambda\rho}|(S_{t},A_{t})=(s,a)]=q^{\pi}(s,a)$ .

Proof.

We restate the complete calculation process of off-policy $\lambda$ -return $G_{t}^{\lambda\rho}$ as belowing

[TABLE]

The last Eq.(33) implies that from the definition of standard $\lambda$ -return Eq.(31) and Eq.(32), we can get the recursive form of Eq.(30).

Expanding Eq.(31), we get the complete $n$ -step return as follows

[TABLE]

By Eq.(28) and Eq.(29), we have

[TABLE]

thus, $\mathbb{E}_{\mu}[G_{t}^{\lambda\rho}|(S_{t},A_{t})=(s,a)]=\mathbb{E}_{\mu}[(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{t+n}|(S_{t},A_{t})=(s,a)]=q^{\pi}(s,a).$ ∎

Appendix B Appendix B: Proof of Eq.(5) and Proposition 1

Eq.(5): Recursive $\lambda$ -Return of Expected Sarsa for On-policy Case

In this section, we prove (I) the forward view of Eq.(5); (II) Eq.(5) is an unbiased estimate of $q^{\pi}$ .

Let $G_{t}^{\lambda,\emph{ES}}=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{t+n},$ where $G_{t}^{t+n}=\sum_{i=0}^{n-1}\gamma^{i}R_{t+i+1}+\gamma^{n}\bar{Q}_{t+n}$ is $n$ -step return of Expected Sarsa and $\bar{Q}_{t+n}=\mathbb{E}_{\pi}[Q(S_{t+n},\cdot)]$ , then $G_{t}^{\lambda,\emph{ES}}$ can be written recursively as: $G_{t}^{\lambda,\emph{ES}}=R_{t+1}+\gamma[(1-\lambda)\bar{Q}_{t+1}+\lambda G_{t+1}^{\lambda,\emph{ES}}].$ Besides, $\mathbb{E}_{\pi}[G_{t}^{\lambda,\emph{ES}}|(S_{t},A_{t})=(s,a))]=q^{\pi}(s,a).$ *

Proof.

By the definition of $n$ -step return of Expected Sarsa: $G_{t}^{t+n}=\sum_{i=0}^{n-1}\gamma^{i}R_{t+i+1}+\gamma^{n}\bar{Q}_{t+n}$ , then $G_{t}^{t+n}$ can be written as the following recursive form:

[TABLE]

Now, we turn to analyses $G_{t}^{\lambda,\text{ES}}$ :

[TABLE]

which is the result in Eq.(5).

For on-policy learning, the following is obvious

[TABLE]

It is similar to the Eq.(35), we have

[TABLE]

which implies $G_{t}^{\lambda,\text{ES}}$ is an unbiased estimate of $q^{\pi}$ . ∎

Proof of Proposition 1

Proposition 1 * Let $\mu$ and $\pi$ be the behavior and target policy, respectively. Consider the $\lambda$ -return of Sarsa and Eq.(6), then $\mathbb{E}_{\mu}[G_{t}^{\lambda\rho,\emph{ES}}|(S_{t},A_{t})=(s,a)]=\mathbb{E}_{\pi}[G_{t}^{\lambda,\emph{S}}|(S_{t},A_{t})=(s,a)]=q^{\pi}(s,a).$ *

Proof.

We expand $\mathbb{E}_{\mu}[G_{t}^{\lambda\rho,\text{ES}}|(S_{t},A_{t})=(s,a)]$ as follows

[TABLE]

where Eq.(41) holds by the following facts: recall $\bar{Q}_{t+1}=\sum_{a\in\mathcal{A}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a)$ , thus

[TABLE]

If we continue to expand Eq.(42), then we have

[TABLE]

∎

Appendix C Appendix C: Proof of Theorem 1

Theorem 1 (Forward View and Variance Analysis of Expected Sarsa $(\lambda)$ with Control Variate) * Let $\mu$ and $\pi$ denote the behavior and target policy, respectively. The $\lambda$ -return with control variate defined in Eq.(7) is equivalent to the following forward view: let $G_{t}^{t}=Q_{t}$ ,*

[TABLE]

Proof.

Firstly, we prove Eq.(43),(44) is equivalent to Eq.(7). Let’s expand $\widetilde{G}_{t}^{\lambda\rho,\text{ES}}$ (in Eq.(44)),

[TABLE]

the last Eq.(46) implies

[TABLE]

which is the Eq.(7) ∎

Appendix D Appendix D: Proof of Eq.(11)

The Equivalence (a) for Eq.(11)

Proof.

[TABLE]

Eq. (48) is a common result in RL, the details of $\mathbb{E}_{\mu}[\sum_{l=t}^{\infty}(\lambda\gamma)^{l-t}\delta^{\text{ES}}_{l}\rho_{t+1:l}]=\mathbb{E}_{\pi}[\sum_{l=t}^{\infty}(\lambda\gamma)^{l-t}\delta^{\text{ES}}_{l}]$ please refer to (?) or Section 6.3.9 in (?). ∎

Appendix E Appendix E: Proof of Theorem 2

Theorem 2 (Policy Evaluation) * For any initial $Q_{0}$ , consider the sequential trajectory collection $\mathcal{T}$ , and the following $Q_{k}$ is learned according to the $k$ -th trajectory $\tau_{k}$ , $k\geq 1$ ,*

[TABLE]

By iterating over $k$ trajectories, the error of policy evaluation is upper bounded by

[TABLE]

Proof.

(Proof of Theorem 2) By Eq.(11), the following equation holds (?; ?),

[TABLE]

It is known that Bellman operator $\mathcal{B}^{\pi}$ is a $\gamma$ -contraction (?),

[TABLE]

Thus we have

[TABLE]

Since $0<\dfrac{(1-\lambda)\gamma}{1-\lambda\gamma}<1$ , Eq.(50) implies that $\mathcal{B}^{\pi}_{\lambda}$ is a $\dfrac{(1-\lambda)\gamma}{1-\lambda\gamma}$ -contraction. By Banach fixed point theorem (?), $\{Q_{k}\}_{k\geq 0}$ generated by $Q_{k+1}=\mathcal{B}^{\pi}_{\lambda}Q_{k}$ converges to the fixed point of $\mathcal{B}^{\pi}_{\lambda}$ .

By Eq.(11), $q^{\pi}$ is the unique fixed point of $\mathcal{B}^{\pi}_{\lambda}$ . Thus, $Q_{k+1}$ converges to $q^{\pi}$ .

Now, we turn to consider the convergence rate. According to (50), it is easy to see $\forall k\in\mathbb{N}$ , $\|Q_{k+1}-Q_{k}\|\leq\dfrac{(1-\lambda)\gamma}{1-\lambda\gamma}\|Q_{k}-Q_{k-1}\|.$ Then, $\forall k,n\in\mathbb{N}$ ,

[TABLE]

let $n\rightarrow\infty$ , we have

[TABLE]

∎

Appendix F Appendix F: Proof of Theorem 3

Theorem 3 * $\widetilde{G}_{t}^{\lambda\rho,\emph{ES}}$ is an unbiased estimator of $q^{\pi}$ , whose variance is given recursively as follows,*

[TABLE]

*where $t\geq 0$ , $\Delta_{t+1}=\bar{Q}_{t+1}-\rho_{t+1}Q_{t+1}-v^{\pi}(s^{{}^{\prime}})+\rho_{t+1}q^{\pi}(s^{{}^{\prime}},a^{{}^{\prime}})$ . *

Lemma 2.

The expectation of the cross-term between the TD error at $t$ and the difference between the return and value at $t+1$ is zero: for any $q(s,a)=\mathbb{E}[G_{t+1}|S_{t}=s,A_{t}=a]$ , i.e., satisfying the Bellman equation, for any bounded function $b:\mathcal{S}\times\mathcal{A}\times\mathcal{R}\times\mathcal{S}\rightarrow\mathbb{R}$ ,

[TABLE]

A similar result of state value function appears in (?), and Lemma 2 expends it to state-action value function. Thus,we omit its proof, and for the details please refer to (?).

Remark 4.

If $G_{t+1}$ is replaced by Expected Sarsa estimator $R_{t+1}+\gamma\bar{Q}_{t+1}$ , Eq.(51) holds.

Proof.

(Proof of Theorem 3)

[TABLE]

Eq.(52) holds due to Remark 4 and Lemma 1 in (?). By the definition of variance, Eq.(52) is equivalent to Eq.(14), which is the result we want to prove. ∎

Appendix G Appendix G: Two-State MDP Example

[TABLE]

then, we have

[TABLE]

Appendix H Appendix H: Proof of Eq.(22)

For a given policy $\pi$ , $Q_{\theta}=\Phi\theta$ , then by the definition of MSPBE objection function, we have,

[TABLE]

where $A=\Phi^{T}\Xi(I-\lambda\gamma P^{\pi})^{-1}(\gamma P^{\pi}-I)\Phi,b=\Phi\Xi(I-\lambda\gamma P^{\pi})^{-1}r.$

Appendix I Appendix I: Proof of Theorem 4

Theorem 4 * Consider the sequence $\{(\theta_{t},\omega_{t})\}_{t=1}^{T}$ generated by (27), step-size $\alpha,\beta$ are positive constants. Let $\bar{\theta}_{T}=\frac{1}{T}(\sum_{t=1}^{T}\theta_{t})$ , $\bar{\omega}_{T}=\frac{1}{T}(\sum_{t=1}^{T}\omega_{t})$ and we chose the step-size $\alpha,\beta$ satisfy $1-\sqrt{\alpha\beta}\|A\|_{*}>0$ , where $\|A\|_{*}=\sup_{\|x\|=1}\|Ax\|$ is operator norm. If parameter $(\theta,\omega)$ is on a bounded $D_{\theta}\times D_{\omega}$ , i.e diam $D_{\theta}=\sup\{\|\theta_{1}-\theta_{2}\|;\theta_{1},\theta_{2}\in D_{\theta}\}\leq\infty$ , diam $D_{\omega}$$\leq\infty$ , $\mathbb{E}[\epsilon_{\Psi}(\bar{\theta}_{T},\bar{\omega}_{T})]$ is upper bounded by:*

[TABLE]

The proof of Theorem 4 uses a inequality (in Eq.(55)) , we present it in the next Proposition 3.

Proposition 3.

Consider the update of expection version in Eq.(25),

[TABLE]

Let $F(\omega)=\frac{1}{2}\|\omega\|^{2}_{M}-b^{\top}\omega$ , then for any $(\theta,\omega)\in D_{\theta}\times D_{\omega}$ , the following hlods

[TABLE]

Proof.

(Proof of Proposition 3) Let sub-gradients of $f$ at $x$ be denoted as $\partial f(x)$ , $\partial f(x)=\{g|f(x)-f(y)\leq g^{T}(x-y),\forall y\in\textbf{dom}(f)\}$ . By the definition of sub-gradient , we have $\frac{\omega_{t}-\omega_{t+1}}{\beta}+A{\theta}_{t}\in\partial F(\omega_{t+1}).$ Since $F$ is convex, then for any $(\theta,\omega)\in D_{\theta}\times D_{\omega}$ the following holds

[TABLE]

By the law of cosines: $2\langle a-b,c-b\rangle=\|a-b\|^{2}+\|b-c\|^{2}-\|a-c\|^{2}$ , we have

[TABLE]

summing them implies the following inequality,

[TABLE]

which is we want to prove. ∎

Proof.

(Proof of Theorem 4) Let $\bar{\theta}_{t}=2\theta_{t}-\theta_{t-1}$ , $\epsilon=\sqrt{\frac{\beta}{\alpha}}$ . then for any $(\theta,\omega)\in D_{\theta}\times D_{\omega}$ :

[TABLE]

By the inequality in Proposition 3, we have

[TABLE]

Summing the Eq.(56) from $t=0:T-1$

[TABLE]

By the Cauchy-Schwarz inequality ${\displaystyle|\langle\mathbf{u},\mathbf{v}\rangle|\leq\|\mathbf{u}\|\|\mathbf{v}\|}\leq\dfrac{1}{2}(\|\mathbf{u}\|^{2}+\|\mathbf{v}\|^{2})$ , we have

[TABLE]

then the following holds, for any $(\theta,\omega)\in D_{\theta}\times D_{\omega}$ :

[TABLE]

Let $\bar{\theta}_{T}=\dfrac{\sum_{t=0}^{T-1}\theta_{t}}{T}$ , $\bar{\omega}_{T}=\dfrac{\sum_{t=0}^{T-1}\omega_{t}}{T}$ and we chose the step-size $\alpha,\beta$ satisfy $1-\sqrt{\alpha\beta}\|A\|>0$ . By the convexity of $F(\omega)$ and $G(\theta)$ , then we deduce from (57):

[TABLE]

By Eq.(58), we have

[TABLE]

∎

Appendix J Appendix J: Details of Experiments

MountainCar Since the state space of mountaincar domain is continuous, we use the open tile coding software http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tilecoding.html to extract feature of states.

In this experiment, we set the number of tilings to be 4 and there are no white noise features. The performance is an average 5 runs and each run contains 5000 episodes. We set $\lambda=0.99$ , $\gamma=0.99$ . The MSPBE/MSE distribution is computed over the combination of step-size, $(\alpha_{k},\frac{\beta_{k}}{\alpha_{k}})\in[0.1\times 2^{j}|j=-10,-9,\cdots,-1,0]^{2}$ , and $\lambda=0.99$ . Following suggestions from Section10.1 in (?), we set all the initial state-action values to be 0, which is optimistic to cause extensive exploration.

Baird Example The Baird example considers the episodic seven-state, two-action MDP. The $\mathtt{dashed}$ action takes the system to one of the six upper states with equal probability, whereas the $\mathtt{solid}$ action takes the system to the seventh state. The behavior policy $b$ selects the $\mathtt{dashed}$ and $\mathtt{solid}$ actions with probabilities $\frac{6}{7}$ and $\frac{1}{7}$ , so that the next-state distribution under it is uniform (the same for all nonterminal states), which is also the starting distribution for each episode. The target policy $\pi$ always takes the solid action, and so the on-policy distribution (for $\pi$ ) is concentrated in the seventh state. The reward is zero on all transitions. The discount rate is $\gamma=0.99$ . The feature $\phi(\cdot,{\mathtt{dashed}})$ and $\phi(\cdot,{\mathtt{solid}})$ are defined as follows,

[TABLE]

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[A. Tamar and Mannor. 2016] A. Tamar, D. D., and Mannor., S. 2016. Learning the variance of the reward-to-go. The Journal of Machine Learning Research 17(13):1––36.
2[Adam and White 2016] Adam, A., and White, M. 2016. Investigating practical linear temporal difference learning. In International Conference on Autonomous Agents & Multiagent Systems , 494–502.
3[Baird 1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995 . Elsevier. 30–37.
4[Balamurugan and Bach 2016] Balamurugan, P., and Bach, F. 2016. Stochastic variance re- duction methods for saddle-point problems. In Advances in Neural Information Processing Systems , 1416––1424.
5[Bertsekas 2009] Bertsekas, D. P. 2009. Convex optimization theory . Athena Scientific Belmont.
6[Bertsekas 2012] Bertsekas, D. P. 2012. Dynamic Programming and Optimal Control , volume 2. Athena scientific Belmont, MA.
7[Dalal et al . 2018 a] Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018 a. Finite sample analyses for td(0) with function approximation. In AAAI 2018 .
8[Dalal et al . 2018 b] Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018 b. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Annual Conference on Learning Theory (COLT) .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Expected Sarsa(λ\lambdaλ) with Control Variate for Variance Reduction

Abstract

Introduction

Contributions

Preliminary and Some Notations

Proposition 1**.**

Expected Sarsa(λ\lambdaλ) with Control Variate

ES(λ\lambdaλ)-CV Algorithm

Theorem 1** (Forward View of ES\mathtt{ES}ES(λ\lambdaλ)-CV\mathtt{CV}CV).**

Proof.

Remark 1**.**

Policy Evaluation

Theorem 2** (Policy Evaluation).**

Proof.

Remark 2**.**

Variance Analysis

Theorem 3** (Variance Analysis of ES\mathtt{ES}ES(λ\lambdaλ)-CV\mathtt{CV}CV).**

Proof.

Numerical Analysis

Gradient Expected Sarsa(λ\lambdaλ)

Instability of ES(λ\lambdaλ) with Function Approximation

Convergent Algorithm

Convergence Analysis

Theorem 4** (Convergence of Algorithm 1).**

Proof.

Remark 3**.**

Related Works and Comparison

Experiments

The Effect of Step-size

Comparison of Empirical MSPBE

Comparison of Empirical MSE

Conclusion

Appendix A Appendix A: λ\lambdaλ-Return of Sarsa for Off-policy Learning

Off-Policy Learning via Importance Sampling

Assumption 1** (Coverage).**

Lemma 1** (Section 3.10, (?); Section 5.9, (?)).**

λ\lambdaλ-Return of Sarsa

Proposition 2**.**

Proof.

Appendix B Appendix B: Proof of Eq.(5) and Proposition 1

Eq.(5): Recursive λ\lambdaλ-Return of Expected Sarsa for On-policy Case

Proof.

Proof of Proposition 1

Proof.

Appendix C Appendix C: Proof of Theorem 1

Proof.

Appendix D Appendix D: Proof of Eq.(11)

The Equivalence (a) for Eq.(11)

Proof.

Appendix E Appendix E: Proof of Theorem 2

Proof.

Appendix F Appendix F: Proof of Theorem 3

Lemma 2**.**

Remark 4**.**

Proof.

Appendix G Appendix G: Two-State MDP Example

Appendix H Appendix H: Proof of Eq.(22)

Appendix I Appendix I: Proof of Theorem 4

Proposition 3**.**

Proof.

Proof.

Appendix J Appendix J: Details of Experiments

Expected Sarsa( $\lambda$ ) with Control Variate for Variance Reduction

Proposition 1.

Expected Sarsa( $\lambda$ ) with Control Variate

ES( $\lambda$ )-CV Algorithm

Theorem 1 (Forward View of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ ).

Remark 1.

Theorem 2 (Policy Evaluation).

Remark 2.

Theorem 3 (Variance Analysis of $\mathtt{ES}$ ( $\lambda$ )- $\mathtt{CV}$ ).

Gradient Expected Sarsa( $\lambda$ )

Instability of ES( $\lambda$ ) with Function Approximation

Theorem 4 (Convergence of Algorithm 1).

Remark 3.

Appendix A Appendix A: $\lambda$ -Return of Sarsa for Off-policy Learning

Assumption 1 (Coverage).

Lemma 1 (Section 3.10, (?); Section 5.9, (?)).

$\lambda$ -Return of Sarsa

Proposition 2.

Eq.(5): Recursive $\lambda$ -Return of Expected Sarsa for On-policy Case

Lemma 2.

Remark 4.

Proposition 3.