When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization
Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, Ali Farhadi

TL;DR
This paper explores the trade-off between image compression quality and generative ease in visual tokenization, introducing a new regularization method that enhances efficiency and performance in image generation models.
Contribution
It introduces Causally Regularized Tokenization (CRT), a novel regularization technique that embeds inductive biases to improve generative performance and efficiency in visual tokenization.
Findings
Smaller models benefit from more compressed latents despite worse reconstruction.
CRT improves generation performance and compute efficiency by 2-3×.
The optimized pipeline matches LlamaGen-3B performance with fewer tokens and parameters.
Abstract
Current image generation methods are based on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. This reveals a fundamental trade-off, do we compress more aggressively to make the latent distribution easier for the stage 2 model to learn even if it makes reconstruction worse? We study this problem in the context of discrete, auto-regressive image generation. Through the lens of scaling laws, we show that smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, demonstrating that generation modeling capacity plays a role in this trade-off. Diving deeper, we rigorously study the connection between compute scaling and the stage 1 rate-distortion trade-off. Next, we introduce Causally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Advanced Steganography and Watermarking Techniques · Visual Attention and Saliency Detection
MethodsDiffusion
