Improving the Efficiency of Off-Policy Reinforcement Learning by   Accounting for Past Decisions

Brett Daley; Christopher Amato

arXiv:2112.12281·cs.LG·December 24, 2021·1 cites

Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

Brett Daley, Christopher Amato

PDF

Open Access

TL;DR

This paper introduces a new multistep operator for off-policy reinforcement learning that accounts for past decisions, providing convergence guarantees and improving learning efficiency over existing methods.

Contribution

It proposes a novel past-dependent trace operator for off-policy learning, with proven convergence and applicability to existing algorithms, enhancing credit assignment and learning speed.

Findings

01

Proves convergence of the new operator for policy evaluation and control.

02

Establishes convergence guarantees for algorithms like Truncated IS and Retrace.

03

Guides development of algorithms considering multiple past decisions.

Abstract

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating ("cutting") the ratios ("traces") to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Age of Information Optimization · Reinforcement Learning in Robotics

MethodsExperience Replay · Retrace