Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Chihyeon Song; Jaewoo Lee; Jinkyoo Park

arXiv:2512.10510·cs.LG·April 9, 2026

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Chihyeon Song, Jaewoo Lee, Jinkyoo Park

PDF

1 Repo

TL;DR

The paper introduces an Adaptive Replay Buffer (ARB) for Offline-to-Online Reinforcement Learning that dynamically prioritizes data based on 'on-policyness' to improve stability and performance.

Contribution

ARB is a simple, learning-free method that adaptively samples data based on policy alignment, enhancing O2O RL without complex procedures.

Findings

01

ARB improves early stability in O2O RL.

02

ARB enhances final performance across benchmarks.

03

ARB is easy to implement and integrates seamlessly.

Abstract

Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

song970407/ARB
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.