WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Fangqi Zhu; Zhengyang Yan; Zicong Hong; Quanxin Shou; Xiao Ma; Song Guo

arXiv:2511.09515·cs.RO·November 13, 2025

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo

PDF

Open Access 3 Reviews

TL;DR

WMPO introduces a pixel-based world model for vision-language-action reinforcement learning, enabling efficient, on-policy training without real-world interactions, leading to improved performance and self-correction in robotic tasks.

Contribution

The paper presents WMPO, a novel framework that uses pixel-based world models for on-policy RL in VLA models, reducing sample complexity and enhancing robustness.

Findings

01

WMPO significantly improves sample efficiency in robotic learning.

02

WMPO achieves stronger performance than off-policy methods.

03

WMPO demonstrates emergent self-correction and lifelong learning capabilities.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The method section of the paper is concise and informative. 2. I believe that there are sufficient ablations being done on the method, and I like that the authors have demonstrated good robustness of the method when dealing with OOD settings and desirable scalability.

Weaknesses

1. I believe that the paper did not address how this method can be extended into a generalist setting. The scope of the environment is also rather limited, albeit there are adequate ablations being conducted. 2. In addition, the paper used OpenVLA-OFT as the base policy. This again limits how much promise the method can bring to generalist policies. If the authors can provide additional ablations without using OpenVLA-OFT, I believe this can strengthen the paper. 3. I believe that the paper did

Reviewer 02Rating 6Confidence 3

Strengths

Clear, modular recipe: world-model rollouts + outcome classifier + GRPO, with practical choices (pixel-space decoding to match VLA features; noisy-frame conditioning; frame-level action injection). The write-up is concrete and reproducible. Consistent empirical gains over strong baselines across four Mimicgen tasks and two rollout budgets; improvements grow with budget (data-efficiency + scaling). Behavioral insights: convincing qualitative evidence of self-correction and reduced “getting

Weaknesses

Heavy compute / practicality: training uses 32× H100 for world-model/WMPO phases (plus 8× H100 for SFT). The paper would benefit from wall-clock, throughput, and ablations on smaller budgets/hardware. Model-world fidelity & safety: while qualitative results are strong, there’s limited quantitative assessment of rollout fidelity (e.g., per-step action-conditioned metrics), failure taxonomy, or safety constraints—especially since outcome-only rewards can reward shortcuts.

Reviewer 03Rating 4Confidence 2

Strengths

1. The paper addresses a critical bottleneck in VLA RL: sample inefficiency and brittleness of imitation learning. While prior works (e.g., RT-2, OpenVLA) have shown impressive generalization, they remain confined to IL and struggle to recover from failures. WMPO’s goal — learning to self-correct via on-policy RL in a world model — is both ambitious and well-justified. 2. The authors’ fine-tuning of the world model on policy-generated trajectories is a principled way to close the distribution

Weaknesses

1. Overclaiming of “On-Policy” Scalability Without Real Costs: While WMPO avoids real-world rollouts during optimization, it still requires 128–1280 real trajectories to fine-tune the world model and initialize policy behavior alignment. This is not zero-shot or low-data RL — it is offline world-model RL with modest real data. Recent works like **IRASim** [1] and **World4RL** [2] also use diffusion world models but start from far fewer real trajectories (e.g., 50–100). The paper does not compare

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning