TL;DR
This paper introduces World Action Models (WAMs), a unified framework combining predictive environment modeling with action generation for embodied AI, and systematically surveys the current landscape.
Contribution
It formally defines WAMs, organizes existing methods into a taxonomy, and analyzes the data ecosystems and evaluation protocols in this emerging field.
Findings
WAMs unify environment prediction with action generation.
Existing methods are categorized into Cascaded and Joint WAMs.
Evaluation protocols focus on visual fidelity, physical commonsense, and action plausibility.
Abstract
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
