Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

Bowen Zheng; Weijian Luo; Guang Yang; Colin Zhang; Tianyang Hu

arXiv:2605.06207·cs.CV·May 8, 2026

Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

PDF

TL;DR

This paper introduces Variable Codebook Size Quantization (VCQ), a method that dynamically adjusts codebook size along sequences to overcome the entropy cliff in autoregressive visual generation, significantly improving performance.

Contribution

The paper formalizes the entropy cliff phenomenon and proposes VCQ, which grows codebook size along sequences, leading to better image generation quality without extra training techniques.

Findings

01

VCQ reduces gFID from 27.98 to 14.80 on ImageNet 256x256.

02

Scaled-up VCQ achieves gFID 1.71 with 684M parameters.

03

Coarse-to-fine semantic hierarchy emerges naturally with VCQ.

Abstract

Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K = 16384$ , this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = ⌈ lo g_{2} N / lo g_{2} K ⌉$ . Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.