Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift
Riashat Islam, Komal K. Teru, Deepak Sharma, Joelle Pineau

TL;DR
This paper identifies state distribution shift as a key challenge in off-policy deep reinforcement learning and proposes a novel method to constrain this shift, leading to improved performance in continuous control tasks.
Contribution
The paper systematically analyzes state distribution mismatch in off-policy learning and introduces a new constrained policy gradient method to minimize this shift.
Findings
Minimizing state distribution mismatch improves off-policy algorithm performance.
The proposed method outperforms baseline algorithms on continuous control tasks.
Constraining state distribution shift enhances stability and effectiveness of policy updates.
Abstract
Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit{extrapolation error}. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a \textit{state distribution shift} due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Optimization and Search Problems · Smart Grid Energy Management
