Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning
Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, Miroslav Pajic, Yuichi Motai

TL;DR
This paper introduces an adaptive method for selecting and fine-tuning policies in offline-to-online reinforcement learning, efficiently utilizing limited online interactions to improve policy performance.
Contribution
It proposes a novel adaptive approach that combines offline evaluation and online fine-tuning with an upper-confidence-bound strategy, addressing reliability and interaction budget issues.
Findings
Our method outperforms baseline approaches on various benchmarks.
Adaptive selection improves policy performance with limited online interactions.
The approach effectively balances exploration and exploitation in policy fine-tuning.
Abstract
In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are evaluated via either off-policy evaluation (OPE) or online evaluation (OE). The policy with the highest estimated value is then deployed and continually fine-tuned. However, this setup has two main issues. First, OPE can be unreliable, making it risky to deploy a policy based solely on those estimates, whereas OE may identify a viable policy with substantial online interaction, which could have been used for fine-tuning. Second--and more importantly--it is also often not possible to determine a priori whether a pretrained policy will improve with post-deployment fine-tuning, especially in non-stationary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
