Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias
Max Sobol Mark, Archit Sharma, Fahim Tajwar, Rafael Rafailov, Sergey, Levine, Chelsea Finn

TL;DR
This paper introduces offline retraining, a method that decouples exploration and evaluation policies in online RL to reduce bias and improve performance, leveraging offline RL techniques during fine-tuning.
Contribution
The paper proposes offline retraining within the OOO framework, enabling separate policies for exploration and evaluation to enhance online RL performance.
Findings
Improves average performance of offline-to-online RL methods by 14-26%.
Achieves state-of-the-art results on D4RL benchmarks.
Enhances online RL performance by 165% on certain environments.
Abstract
It is desirable for policies to optimistically explore new states and behaviors during online reinforcement learning (RL) or fine-tuning, especially when prior offline data does not provide enough state coverage. However, exploration bonuses can bias the learned policy, and our experiments find that naive, yet standard use of such bonuses can fail to recover a performant policy. Concurrently, pessimistic training in offline RL has enabled recovery of performant policies from static datasets. Can we leverage offline RL to recover better policies from online interaction? We make a simple observation that a policy can be trained from scratch on all interaction data with pessimistic objectives, thereby decoupling the policies used for data collection and for evaluation. Specifically, we propose offline retraining, a policy extraction step at the end of online fine-tuning in our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Digital Games and Media
