Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning
Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Gaetan Lin, Hao Chen,, Liwei Wu, Ning Jia, Shiji Song, Gao Huang

TL;DR
FamO2O introduces a framework for offline-to-online reinforcement learning that adaptively balances policy improvement and constraints at the state level, leading to better performance by leveraging data quality variations.
Contribution
It proposes a novel family of policies and a balance model for state-adaptive policy improvement, enhancing offline-to-online RL performance.
Findings
Achieves state-of-the-art results on D4RL benchmark.
Statistically significant improvements over existing methods.
Theoretically proves the necessity of state-adaptive balances.
Abstract
Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Advanced Bandit Algorithms Research
