Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Yunfan Lou; Xiaowei Chi; Xiaojie Zhang; Zezhong Qian; Chengxuan Li; Rongyu Zhang; Yaoxu Lyu; Guoyu Song; Chuyao Fu; Haoxuan Xu; Pengwei Wang; Shanghang Zhang

arXiv:2604.19683·cs.RO·April 23, 2026

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Yunfan Lou, Xiaowei Chi, Xiaojie Zhang, Zezhong Qian, Chengxuan Li, Rongyu Zhang, Yaoxu Lyu, Guoyu Song, Chuyao Fu, Haoxuan Xu, Pengwei Wang, Shanghang Zhang

PDF

TL;DR

The paper introduces the Mask World Model (MWM), a novel approach that predicts semantic masks instead of pixels to improve the robustness and generalization of robot policies, outperforming RGB-based models.

Contribution

The paper proposes MWM, which leverages video diffusion architectures to focus on essential physical dynamics, reducing overfitting to irrelevant visual details for better robot control.

Findings

01

MWM outperforms RGB-based models on LIBERO and RLBench benchmarks.

02

MWM demonstrates superior robustness to visual noise and texture loss.

03

Real-world experiments confirm improved generalization and resilience.

Abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.