Unsupervised-to-Online Reinforcement Learning
Junsu Kim, Seohong Park, Sergey Levine

TL;DR
This paper introduces U2O RL, a novel framework that replaces domain-specific offline RL with unsupervised offline pre-training, enabling better generalization, improved stability, and reusability across multiple downstream tasks in reinforcement learning.
Contribution
The paper proposes a new unsupervised-to-online RL framework that enhances transferability and performance over traditional offline-to-online RL, with a practical recipe for implementation.
Findings
U2O RL matches or outperforms previous offline-to-online RL methods.
U2O RL enables reuse of a single pre-trained model across multiple tasks.
U2O RL improves stability and representation learning in RL environments.
Abstract
Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is well-written, and the proposed U2O RL framework is explained clearly. - Research into specific paradigms within the pretrain-finetune framework for RL is valuable, and this paper contributes to that discussion.
- While this paper proposes the U2O RL framework, it does not introduce any novel methods. Both the unsupervised offline RL pre-training and the online fine-tuning stages rely on existing algorithms. Additionally, the reward scaling adjustment in the bridging stage has already been employed in prior reward design approaches. Proposing a new “framework” is reasonable, but I believe the paper needs to provide more substantial evidence on why this U2O RL framework is more effective than the traditi
1. The paper is clearly written and well motivated, having a single reusable model to perform finetuning makes sense and can improve RL pretraining. 2. The paper looks at a particular method HILP and provides a reward scale matching scheme to enable finetuning. This turns out to be quite important in performing efficient finetuning. 3. The paper considers a wide variety of tasks to demonstrate potential improvements over offline to online finetuning.
1. Insufficient empirical evidence to claim U20>O2O: In figure 3, it seems the results are not significant in 10/14 environments. How can we claim the U20 is a better strategy? Furthermore, insufficient details are provided about baselines of O2O and off policy RL - eg. do they use the same network sizes and discount factor? It is clear in Table 1 that the baselines and U2O use different network sizes and discount factor as the prior entries are based on discount of 0.99 and use a network size
Overall, I found the paper well-written and easy to follow. The problem that the authors are working on -- pre-training in RL -- is important and definitely of interest to the community. For the most part, I found the experiments insightful and thorough. I particularly like the feature dot product analysis, this is a nice addition to the paper.
The biggest issue with this paper is that the abstract claims "we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches ...". However, Figure 12 suggests that O2O outperforms U2O when the offline data only contains expert data. As such, this claim that U2O either outperforms or matches O2O is false. It is dependent on the type of offline data. The authors clearly address this in the conclusion but as a reader, I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Grid Energy Management · Reinforcement Learning in Robotics · Adaptive Dynamic Programming Control
