Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model

Shuncheng He; Hongchang Zhang; Jianzhun Shao; Yuhang Jiang; Xiangyang Ji

arXiv:2506.19643·cs.LG·June 25, 2025

Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model

Shuncheng He, Hongchang Zhang, Jianzhun Shao, Yuhang Jiang, Xiangyang Ji

PDF

4 Reviews

TL;DR

This paper explores how unsupervised data generation can improve offline reinforcement learning by bridging the gap between behavioral and optimal policies, supported by theoretical analysis and empirical validation.

Contribution

It introduces a theoretical framework linking data distribution differences to performance gaps and proposes UDG, an unsupervised data generation method, to enhance offline RL in task-agnostic settings.

Findings

01

UDG outperforms supervised data generation on unknown tasks.

02

Theoretical analysis links distribution divergence to performance gaps.

03

Empirical results validate the effectiveness of UDG in offline RL.

Abstract

Offline reinforcement learning (RL) recently gains growing interests from RL researchers. However, the performance of offline RL suffers from the out-of-distribution problem, which can be corrected by feedback in online RL. Previous offline RL research focuses on restricting the offline algorithm in in-distribution even in-sample action sampling. In contrast, fewer work pays attention to the influence of the batch data. In this paper, we first build a bridge over the batch data and the performance of offline RL algorithms theoretically, from the perspective of model-based offline RL optimization. We draw a conclusion that, with mild assumptions, the distance between the state-action pair distribution generated by the behavioural policy and the distribution generated by the optimal policy, accounts for the performance gap between the policy learned by model-based offline RL and the…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

1. While the observation that optimality gap depends on the distance between optimal policy and sampling policy is somewhat well-known in the line of offline RL research (e.g., [1][2][3], to name a few), the authors draw an explicit connection between the Wasserstein distance of visitation measures and the optimality gap, which is novel and interesting to me. [1] Y. Duan, Z. Jia, M. Wang, Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation. (2020) [2] T. Xie et al., Poli

Weaknesses

1. The approximation in theory parts need more careful handling. In particular, in page 5 of the paper, the authors assume that (1) the empirical visitation $\hat rho ^\pi_T ~ \rho^\pi_T ~ \rho^\pi_{\hat T}$. In addition, the authors further assume that (2) the visitation $\rho^{\pi^*}_{\hat T} ~ \hat \rho^{\pi^\beta}_{T}$. I have a few questions regarding the two hypothesis. * While I agree that (1) hold if dataset is sufficiently large, I wonder what would be the challenge of offline RL in su

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

* Paper is on a relevant topic to Offline RL, which is understanding the importance of the data itself on the performance of algorithms.

Weaknesses

The paper has a significant number of shortcomings. * The paper has a number of language issues. For instance, the title itself is grammatically incorrect; the subtitle should be something like "A model-based perspective" or "A perspective from model-based Reinforcement Learning". In the abstract, "Previous offline RL research focuses on restricting the offline algorithm in in-distribution even in-sample action sampling. In contrast, fewer work pays attention to the influence of the batch data"

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

According to the author's claim, this paper is the first work to establish a theoretical link between behavior batch data and the performance of offline RL algorithms in Lipschitz continuous environments. The authors introduce a framework, UDG, for unsupervised offline RL. Experimental results show that on robotic locomotion tasks like Ant-Angle and Cheetah-Jump, UDG outperforms traditional offline RL that uses either random or supervised data.

Weaknesses

The motivation of this article is insufficient. One fundamental notion we perceive in offline reinforcement learning is the high cost associated with data collection. However, the paper assumes that data can be freely collected in a virtual environment, which seems contradictory to the original intention behind the design of offline reinforcement learning. This indicates a potential misalignment in perspective. Furthermore, the core idea presented in the study appears to lack originality, raisin

Reviewer 04Rating 3· reject, not good enoughConfidence 4

Strengths

The topic of generating data for offline RL in an unsupervised fashion seems appealing. The paper provides a theoretical analysis of the problem of unsupervised data generation and instantiates its theoretical findings as a practical minimax algorithms. Empirical experiments show the effectiveness of unsupervised data generation.

Weaknesses

The evaluation is limited. The experiments are only carried out on two environments: cheetah and ant angle, while even mujoco alone has over 10 tasks where the authors can validate their method. For example, environments from https://arxiv.org/pdf/2110.15191.pdf may be adopted. Comparison with prior literature seems inadequate. As discussed in the "Unsupervised RL" section of related works, there is extensive literature for unsupervised RL driven by "curiosity, maximum coverage of states, and d

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.