Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning
Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Amos Storkey

TL;DR
This paper introduces PTGOOD, a planning-based exploration method for offline-to-online reinforcement learning that effectively explores out-of-distribution states, leading to improved policy performance without reward modification.
Contribution
The paper proposes PTGOOD, a non-myopic planning algorithm that encourages exploration in high-reward, out-of-distribution regions, avoiding issues of reward modification and myopic exploration.
Findings
PTGOOD outperforms baseline methods in continuous control tasks.
It significantly improves online fine-tuning returns.
It prevents suboptimal policy convergence in several environments.
Abstract
Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Context-Aware Activity Recognition Systems · Age of Information Optimization
