Statistically Efficient Variance Reduction with Double Policy Estimation   for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning

Hanhan Zhou; Tian Lan; Vaneet Aggarwal

arXiv:2308.14897·cs.LG·August 30, 2023·2 cites

Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning

Hanhan Zhou, Tian Lan, Vaneet Aggarwal

PDF

Open Access

TL;DR

This paper introduces DPE, a novel offline reinforcement learning algorithm that combines sequence modeling with double policy estimation to reduce variance and improve performance on benchmark tasks.

Contribution

The paper proposes DPE, a new method integrating sequence modeling and double policy estimation with proven variance reduction, advancing offline RL performance.

Findings

01

DPE outperforms state-of-the-art baselines on several OpenAI Gym tasks.

02

The method demonstrates statistically proven variance reduction.

03

DPE effectively combines sequence modeling with double policy estimation for offline RL.

Abstract

Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics