One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang; Shengchao Yuan; Xiaoxin Bai; Zhiyuan Jing; De Ma; Gang Pan; Bin Liu

arXiv:2605.07931·cs.CV·May 15, 2026

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu

PDF

TL;DR

This paper introduces OneWM-VLA, a method that compresses visual input into a single token per frame, maintaining long-horizon performance while reducing visual bandwidth in world models for VLA policies.

Contribution

It proposes a novel approach to parameterize world modules with minimal visual information, improving efficiency and performance in long-horizon tasks.

Findings

01

Per-frame visual bandwidth can be reduced to a single token without performance loss.

02

OneWM-VLA improves success rates on MetaWorld, LIBERO-Long, and Fold Cloth tasks.

03

The method achieves significant performance gains with fewer visual inputs.

Abstract

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.