The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Takuya Shiba

arXiv:2604.03191·cs.RO·April 6, 2026

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Takuya Shiba

PDF

TL;DR

This paper introduces the 'Compression Gap' principle, explaining why scaling vision-language-action models is limited by information bottlenecks, especially when actions are discretized, impacting model performance improvements.

Contribution

It presents an information-theoretic framework to understand scaling limitations in visuomotor pipelines, emphasizing the role of bottlenecks like codebooks in discrete action representations.

Findings

01

Encoder upgrades improve Diffusion Policy performance significantly.

02

Codebook capacity limits the impact of encoder improvements in discretized actions.

03

Relaxing codebook size partially recovers encoder sensitivity, confirming the bottleneck hypothesis.

Abstract

Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.