# Expected Sarsa($\lambda$) with Control Variate for Variance Reduction

**Authors:** Long Yang, Yu Zhang, Jun Wen, Qian Zheng, Pengfei Li, Gang Pan

arXiv: 1906.11058 · 2019-09-09

## TL;DR

This paper introduces a variance reduction technique for off-policy reinforcement learning algorithms using control variates in Expected Sarsa(λ), resulting in lower variance and improved convergence properties compared to existing methods.

## Contribution

The paper proposes the ES(λ)-CV algorithm with control variates for variance reduction and extends it to GES(λ) for convergence with linear function approximation.

## Key findings

- ES(λ)-CV has lower variance than Expected Sarsa(λ).
- GES(λ) achieves a convergence rate of O(1/T).
- Numerical experiments show better performance than state-of-the-art algorithms.

## Abstract

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to $\mathtt{Expected}$ $\mathtt{Sarsa}$($\lambda$) and propose a tabular $\mathtt{ES}$($\lambda$)-$\mathtt{CV}$ algorithm. We prove that if a proper estimator of value function reaches, the proposed $\mathtt{ES}$($\lambda$)-$\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$($\lambda$). Furthermore, to extend $\mathtt{ES}$($\lambda$)-$\mathtt{CV}$ to be a convergent algorithm with linear function approximation, we propose the $\mathtt{GES}$($\lambda$) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of $\mathtt{GES}$($\lambda$) achieves $\mathcal{O}(1/T)$, which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: $\mathtt{GQ}$($\lambda$), $\mathtt{GTB}$($\lambda$) and $\mathtt{ABQ}$($\zeta$).

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.11058/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/1906.11058/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/1906.11058/full.md

---
Source: https://tomesphere.com/paper/1906.11058