Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL
Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang

TL;DR
This paper introduces a general offline-to-online reinforcement learning method that re-evaluates and calibrates critics to handle dataset-environment mismatches, enabling stable and efficient online fine-tuning from any offline policy.
Contribution
It proposes a novel approach to handle evaluation and improvement mismatches in O2O RL, allowing for general application across various offline and online methods.
Findings
Achieves stable performance improvement on multiple tasks
Outperforms state-of-the-art O2O RL methods
Effectively handles dataset-environment mismatches
Abstract
Offline-to-online (O2O) reinforcement learning (RL) provides an effective means of leveraging an offline pre-trained policy as initialization to improve performance rapidly with limited online interactions. Recent studies often design fine-tuning strategies for a specific offline RL method and cannot perform general O2O learning from any offline method. To deal with this problem, we disclose that there are evaluation and improvement mismatches between the offline dataset and the online environment, which hinders the direct application of pre-trained policies to online fine-tuning. In this paper, we propose to handle these two mismatches simultaneously, which aims to achieve general O2O learning from any offline method to any online method. Before online fine-tuning, we re-evaluate the pessimistic critic trained on the offline dataset in an optimistic way and then calibrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmbedded Systems Design Techniques · Iterative Learning Control Systems · VLSI and Analog Circuit Testing
