One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu

TL;DR
This paper introduces OneWM-VLA, a method that compresses visual input into a single token per frame, maintaining long-horizon performance while reducing visual bandwidth in world models for VLA policies.
Contribution
It proposes a novel approach to parameterize world modules with minimal visual information, improving efficiency and performance in long-horizon tasks.
Findings
Per-frame visual bandwidth can be reduced to a single token without performance loss.
OneWM-VLA improves success rates on MetaWorld, LIBERO-Long, and Fold Cloth tasks.
The method achieves significant performance gains with fewer visual inputs.
Abstract
Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
