A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Siyuan Guo; Yanchao Sun; Jifeng Hu; Sili Huang; Hechang Chen; Haiyin Piao; Lichao Sun; Yi Chang

arXiv:2306.07541·cs.LG·January 19, 2026·2 cites

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Siyuan Guo, Yanchao Sun, Jifeng Hu, Sili Huang, Hechang Chen, Haiyin Piao, Lichao Sun, Yi Chang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SUNG, a unified uncertainty-guided framework that enhances offline-to-online reinforcement learning by using uncertainty estimation for exploration and adaptive exploitation, achieving state-of-the-art finetuning results.

Contribution

SUNG unifies exploration and exploitation strategies in offline-to-online RL using a VAE-based uncertainty estimator, improving finetuning performance across diverse environments.

Findings

01

Achieves state-of-the-art online finetuning performance.

02

Effectively balances exploration and exploitation via uncertainty.

03

Demonstrates robustness across various datasets and environments.

Abstract

Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. In view of this, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

Offline to Online learning is still a relatively new discipline and the authors appear to have found a simple yet effective method to outperform prior works. Selecting actions optimistically in the face of uncertainty seems like a good exploration strategy for O2O, since it's been proven to work in prior works on other exploration tasks. Especially the fact that the method is compatible with many offline RL algorithms that can be used under the hood as base algorithm appears to be a practical ad

Weaknesses

I find the formulation of the SUNG framework a bit counterintuitive: The authors mention that they want to have high-uncertainty actions, yet at the same time they only sample "near-on-policy actions for exploration", which appears contradicting. Further, during the optimization / policy improvement part (green arrows in fig 1), the same percentage p of the batch is always labeled as OOD, which is not consistent, since the absolute uncertainty value at which a sample could be labeled OOD can var

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. This paper is well-written and easy to follow. 2. The problem studied in this paper is important and has attracted increasing attention. 3. The experiment is thorough, and the authors compared SUNG against a large pool of recent methods.

Weaknesses

This paper incrementally adds many existing techniques, making evaluating its contribution difficult. For example, the utilization of VAE for uncertainty quantification cannot distinguish SUNG from MANY offline-to-online or offline RL methods [1,2]. The bi-level action selection is a relatively heuristic strategy; the authors did not provide any theoretical analysis/insight into why it is effective, especially for the claim "we establish the ranking criteria for the finalist action set as uncert

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

Research into the domain of online finetuning holds significant importance within the field of offline learning. The experimental evaluation suggests that there is potential for improvement in the finetuning performance when the proposed approach is combined with various offline RL methods across a range of environments and datasets from the D4RL benchmark. These findings indicate the adaptability and practicality of the suggested technique in different settings. The paper demonstrates a high

Weaknesses

The primary concern raised with regard to this paper pertains to its novelty. The concept of leveraging uncertainty in the context of offline learning is a well-established one. From the perspective of reviewers, the key innovation in this article lies in the utilization of a VAE for quantifying uncertainty, which does not represent a notable departure from conventional methods. While this paper introduces a straightforward empirical method, it is notable for its absence of a comprehensive theo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing