Mirage2Matter: A Physically Grounded Gaussian World Model from Video

Zhengqing Gao; Ziwen Li; Xin Wang; Jiaxin Huang; Zhenyang Ren; Mingkai Shao; Hanlue Zhang; Tianyu Huang; Yongkang Cheng; Yandong Guo; Runqi Lin; Yuanyuan Wang; Tongliang Liu; Kun Zhang; Mingming Gong

arXiv:2602.00096·cs.CV·February 3, 2026

Mirage2Matter: A Physically Grounded Gaussian World Model from Video

Zhengqing Gao, Ziwen Li, Xin Wang, Jiaxin Huang, Zhenyang Ren, Mingkai Shao, Hanlue Zhang, Tianyu Huang, Yongkang Cheng, Yandong Guo, Runqi Lin, Yuanyuan Wang, Tongliang Liu, Kun Zhang, Mingming Gong

PDF

Open Access

TL;DR

This paper introduces a scalable, photorealistic world modeling framework using multi-view videos and Gaussian Splatting, enabling effective simulation for embodied AI without expensive sensors.

Contribution

It presents a novel simulation framework that reconstructs real environments into high-fidelity, physically grounded models from videos, improving scalability and practicality for embodied AI training.

Findings

01

Vision Language Action models trained on simulated data perform well zero-shot.

02

The framework achieves high-fidelity scene reconstruction from videos.

03

Simulated data matches or surpasses real-world data in downstream tasks.

Abstract

The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition