TL;DR
This paper introduces DARE, a novel framework for offline-to-online reinforcement learning that dynamically relaxes constraints based on behavioral consistency, improving fine-tuning stability and performance.
Contribution
DARE is the first method to condition constraint relaxation on behavioral consistency, enabling flexible, sample-level adaptation in offline-to-online RL.
Findings
DARE improves fine-tuning stability in D4RL benchmarks.
DARE achieves superior final performance over strong baselines.
Behavior-based sample exchange enhances offline-online distinction.
Abstract
Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves during fine-tuning, rendering data origin a misleading basis for constraint handling and thereby leading to objective-data mismatch. We therefore propose Dynamic Alignment for RElaxation (DARE), a distribution-aware framework for sample-level constraint relaxation based on the behavioral consistency with a behavior model. To our knowledge, DARE is the first to condition constraint relaxation on behavioral consistency via a posterior-induced exchange mechanism, moving beyond a binary offline/online data distinction. Importantly, DARE requires only per-sample behavioral alignment, enabling instantiation on top of many offline algorithms with flexible choices of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
