The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Takuya Shiba

TL;DR
This paper introduces the 'Compression Gap' principle, explaining why scaling vision-language-action models is limited by information bottlenecks, especially when actions are discretized, impacting model performance improvements.
Contribution
It presents an information-theoretic framework to understand scaling limitations in visuomotor pipelines, emphasizing the role of bottlenecks like codebooks in discrete action representations.
Findings
Encoder upgrades improve Diffusion Policy performance significantly.
Codebook capacity limits the impact of encoder improvements in discretized actions.
Relaxing codebook size partially recovers encoder sensitivity, confirming the bottleneck hypothesis.
Abstract
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
