Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement   Learning

Riashat Islam; Raihan Seraj; Samin Yeasar Arnob; Doina Precup

arXiv:1912.05109·cs.LG·December 12, 2019

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning

Riashat Islam, Raihan Seraj, Samin Yeasar Arnob, Doina Precup

PDF

Open Access

TL;DR

This paper introduces a doubly robust off-policy critic evaluation method for actor-critic algorithms, improving stability and performance in continuous control and stochastic reward scenarios.

Contribution

It extends doubly robust estimators to actor-critic algorithms, enhancing off-policy critic evaluation and robustness in reinforcement learning.

Findings

01

Significantly improves performance in continuous control tasks.

02

Enhances robustness under stochastic and corrupted reward signals.

03

Reduces variance and instability in off-policy critic evaluation.

Abstract

We study the problem of off-policy critic evaluation in several variants of value-based off-policy actor-critic algorithms. Off-policy actor-critic algorithms require an off-policy critic evaluation step, to estimate the value of the new policy after every policy gradient update. Despite enormous success of off-policy policy gradients on control tasks, existing general methods suffer from high variance and instability, partly because the policy improvement depends on gradient of the estimated value function. In this work, we present a new way of off-policy policy evaluation in actor-critic, based on the doubly robust estimators. We extend the doubly robust estimator from off-policy policy evaluation (OPE) to actor-critic algorithms that consist of a reward estimator performance model. We find that doubly robust estimation of the critic can significantly improve performance in continuous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Cardiovascular Function and Risk Factors