Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan; Zibin Dong; Yicheng Liu; Hang Zhao

arXiv:2603.16666·cs.CV·March 24, 2026

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao

PDF

Open Access 2 Models 2 Datasets

TL;DR

Fast-WAM demonstrates that explicit future imagination at test time is not necessary for strong embodied control, as training with video modeling alone suffices, leading to faster and competitive performance.

Contribution

The paper introduces Fast-WAM, a WAM architecture that removes test-time future prediction, showing that training with video modeling alone maintains performance and significantly improves speed.

Findings

01

Fast-WAM achieves over 4x faster inference than traditional WAMs.

02

Video co-training during training is more critical than test-time future prediction.

03

Fast-WAM performs competitively on simulation and real-world benchmarks.

Abstract

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics