Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach
Baturay Saglam, Dogan C. Cicek, Furkan B. Mutlu, Suleyman S. Kozat

TL;DR
This paper introduces a new one-step correction method for off-policy actor-critic algorithms that reduces bias and improves data efficiency in continuous control tasks, especially with deterministic policies.
Contribution
It proposes a novel policy similarity measure for single-step off-policy correction applicable to deterministic neural policies, addressing limitations of existing importance sampling techniques.
Findings
Achieves higher returns with fewer steps compared to existing methods.
Demonstrates theoretical guarantees for safe off-policy learning.
Improves performance in continuous control benchmarks.
Abstract
Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent's policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsQ-Learning
