WorldGym: World Model as An Environment for Policy Evaluation

Julian Quevedo; Ansh Kumar Sharma; Yixiang Sun; Varad Suryavanshi; Percy Liang; Sherry Yang

arXiv:2506.00613·cs.RO·October 1, 2025

WorldGym: World Model as An Environment for Policy Evaluation

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, Sherry Yang

PDF

3 Reviews

TL;DR

WorldGym is a novel environment built on a world model that enables efficient, realistic, and safe evaluation of robot control policies by simulating real-world interactions with minimal input data.

Contribution

We introduce WorldGym, a world-model-based environment for policy evaluation that correlates well with real-world success and enables efficient testing with minimal data.

Findings

01

Policy success rates in WorldGym correlate with real-world success.

02

WorldGym preserves relative policy rankings across different models.

03

It enables evaluation of generalization to new tasks with only initial frames.

Abstract

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- WorldGym proposes a way to address a genuine need for safe, reproducible, and cost-effective policy testing before real-world deployment. - WorldGym shows impressive correlation between simulated and real-world success rates. It also does strong empirical validation. - The single world model generalizes across diverse tasks and environments. - The paper is well-written, the problem motivation is compelling, and the approach is clearly described.

Weaknesses

- The paper does provide quantitative metrics on world model prediction quality. - The world model still shows physics-inplausible predictions, as shown in Figure 14. The paper also does not propose any method to help the video model learn real-world physics. - The paper does not compare with baselines like building digital twins to do policy evaluation. - The paper does not show the computational requirements and the inference speed of the world model. - The paper does not show qualitative resu

Reviewer 02Rating 6Confidence 4

Strengths

1. Originality: reframes policy evaluation as “rollout in one learned world” rather than per-task simulators; leverages the one-world prior and diverse training data. 2. Practicality: one real frame + actions, no hand-coded simulators; horizon–chunk alignment is a clean trick that supports mixed policies while saving compute. 3. Clarity: the OPE formulation and rollout protocol are easy to follow; model/policy interfaces are explicit. 4. Significance: high sim-to-real correlation and preserved

Weaknesses

1. VLM reward calibration is under-analyzed: the proposed VLM grader is central, but the paper does not show reliability audits (human agreement, prompt/temperature sensitivity, temporal credit). The authors should add thorough calibration and robustness studies. 2. Dynamics fidelity over long horizons is not quantified: the work shows plausibility, but does not report compounding-error metrics (e.g., FVD/LPIPS vs time, controllability under action perturbations). The authors should measure erro

Reviewer 03Rating 6Confidence 3

Strengths

The paper introduces a clear and innovative use of video diffusion world models for policy evaluation instead of training, which reframes how offline policy analysis can be done without physical robots. The experiments show consistent correlation between simulated and real results, maintain relative performance rankings across models, and include OOD tests that reveal model weaknesses. Technically, the framework is efficient, combining causal temporal attention and adaptive horizon prediction to

Weaknesses

I believe a more convincing way to show WorldGym’s effectiveness would be to perform reinforcement learning with WorldGym as the environment and then test the resulting policy in simulation or the real world, but the paper doesn’t do that.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Adam · Dense Connections · Softmax · Diffusion · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding