World Guidance: World Modeling in Condition Space for Action Generation

Yue Su; Sijin Chen; Haixin Shi; Mingyu Liu; Zhengshen Zhang; Ningyuan Huang; Weiheng Zhong; Zhengbang Zhu; Yuxiao Liu; Xihui Liu

arXiv:2602.22010·cs.RO·February 26, 2026

World Guidance: World Modeling in Condition Space for Action Generation

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, Xihui Liu

PDF

Open Access

TL;DR

This paper introduces WoG, a framework that models future observations in a compact condition space to improve fine-grained action generation and generalization in Vision-Language-Action models, validated through extensive experiments.

Contribution

WoG proposes a novel approach to map future observations into a compact condition space, enhancing action inference and world modeling capabilities.

Findings

01

Outperforms existing future prediction methods in simulation and real-world tasks.

02

Achieves better generalization and fine-grained action generation.

03

Learns effectively from large-scale human manipulation videos.

Abstract

Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning