TL;DR
The paper introduces an Adaptive Replay Buffer (ARB) for Offline-to-Online Reinforcement Learning that dynamically prioritizes data based on 'on-policyness' to improve stability and performance.
Contribution
ARB is a simple, learning-free method that adaptively samples data based on policy alignment, enhancing O2O RL without complex procedures.
Findings
ARB improves early stability in O2O RL.
ARB enhances final performance across benchmarks.
ARB is easy to implement and integrates seamlessly.
Abstract
Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
