ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts

Jing-Cheng Pang; Kaiyuan Li; Yidi Wang; Si-Hang Yang; Shengyi Jiang; Yang Yu

arXiv:2505.10010·cs.LG·May 16, 2025

ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts

Jing-Cheng Pang, Kaiyuan Li, Yidi Wang, Si-Hang Yang, Shengyi Jiang, Yang Yu

PDF

Open Access 1 Repo 1 Datasets 4 Reviews

TL;DR

ImagineBench is a new benchmark designed to evaluate offline reinforcement learning algorithms that utilize both real environment data and large language model-generated imaginary rollouts across diverse tasks, highlighting current limitations and future opportunities.

Contribution

This paper introduces ImagineBench, the first comprehensive benchmark for evaluating offline RL with LLM-generated imaginary rollouts across multiple domains and task complexities.

Findings

01

Existing offline RL algorithms perform poorly on unseen tasks with imaginary rollouts.

02

Performance on hard tasks is significantly lower with imaginary rollouts compared to real data.

03

The benchmark reveals the need for improved algorithms to better leverage LLM-generated experience.

Abstract

A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that leverage both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts; (2) diverse domains of environments covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions with varying complexity levels to facilitate…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

The paper provides extensive experiments demonstrating the strengths and limitations of existing offline RL methods when trained on synthetic data.

Weaknesses

1) Typo in the title: It should be "Rollouts" instead of "Rollout." It is correctly written in OpenReview but with a typo in the PDF. In addition with the white space issues throughout the paper (e.g., before section titles of after Figure 7), it feels like the paper was overlength and the authors tried to cut it way too aggressively. 2) Line 74: What is the difference between novel and unseen? 3) The related work section should also discuss methods that learn/use dynamics models (or as they are

Reviewer 02Rating 2Confidence 3

Strengths

**Strengths** **Clarity** * The paper is well written and uses clear language. * All components of the benchmark are sufficiently well described for me to understand the tasks and objectives of the benchmark. **Novelty** * I’m not aware of any benchmark for studying the effect of LLM generated rollouts in RL. **Experimental evaluation** * There is a quite extensive set of experiments in this paper which is quite laudable.

Weaknesses

**Clarity** * The Figures are very difficult to read * The analysis of the generated data should come before any RL experiments are executed. Otherwise, it is unclear why the RL agents perform the way they do. **Related Work** * The related work seems to largely focus on LLM generated data and it is unclear to me why generating data is specific to LLMs. To me, data generation with diffusion is just as relevant to the topic and there is plenty of work on this out there. **Motivation** * It is u

Reviewer 03Rating 2Confidence 5

Strengths

The paper presents *ImagineBench*, a large-scale benchmark designed to evaluate reinforcement learning (RL) algorithms trained with both real and LLM-generated (“imaginary”) rollouts. The motivation is relevant, as the community lacks standardized evaluation for LLM-driven synthetic data in RL. The framework integrates multiple existing environments (Meta-World, LIBERO, MuJoCo, etc.) under a unified interface and provides detailed documentation, hierarchical task levels, and open-source code for

Weaknesses

The proposed benchmark raises several weaknesses: 1.This benchmark seems simply the integrating of existing simulation benchmarks (with Meta-world, LIBERO, Mujoco etc.). There is no new simulation scenarios are introduced. The massive integration and providing universal interface are already done by previous works such as RoboVerse. 2.The benchmark tasks that author selected are only focusing on relative simple manipulation tasks in which completion require only one primitive or locomotion t

Reviewer 04Rating 6Confidence 3

Strengths

1. Timely and relevant contribution —The paper addresses an emerging topic at the intersection of LLMs and reinforcement learning by providing a standardized benchmark for RL from imaginary LLM rollouts. 2. Comprehensive design — ImagineBench covers multiple environments (Meta-World, LIBERO, BabyAI, CLEVR-Robot, MuJoCo) and diverse goals with different levels of complexity. 3. Empirical evaluation — The study benchmarks a set of offline RL methods, providing valuable reference results for fut

Weaknesses

1. Methodological transparency in LLM fine-tuning — The paper briefly mentions adding layers to handle environmental data but lacks details on architecture modifications, representation of continuous variables, or training objectives. 2. Methodological transparency in Section 5 — some aspects of the evaluation in Section 5 are unclear (see Questions for more details). 3. Table 2 only reports aggregate success, transition, and legality scores for one environment. A more systematic analysis betw

Code & Models

Repositories

lamda-rl/imaginebench
pytorchOfficial

Datasets

NJU-RLer/ImagineBench
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling