When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization

Vivek Ramanujan; Kushal Tirumala; Armen Aghajanyan; Luke Zettlemoyer; Ali Farhadi

arXiv:2412.16326·cs.CV·December 12, 2025

When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization

Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, Ali Farhadi

PDF

Open Access

TL;DR

This paper explores the trade-off between image compression quality and generative ease in visual tokenization, introducing a new regularization method that enhances efficiency and performance in image generation models.

Contribution

It introduces Causally Regularized Tokenization (CRT), a novel regularization technique that embeds inductive biases to improve generative performance and efficiency in visual tokenization.

Findings

01

Smaller models benefit from more compressed latents despite worse reconstruction.

02

CRT improves generation performance and compute efficiency by 2-3×.

03

The optimized pipeline matches LlamaGen-3B performance with fewer tokens and parameters.

Abstract

Current image generation methods are based on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. This reveals a fundamental trade-off, do we compress more aggressively to make the latent distribution easier for the stage 2 model to learn even if it makes reconstruction worse? We study this problem in the context of discrete, auto-regressive image generation. Through the lens of scaling laws, we show that smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, demonstrating that generation modeling capacity plays a role in this trade-off. Diving deeper, we rigorously study the connection between compute scaling and the stage 1 rate-distortion trade-off. Next, we introduce Causally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Advanced Steganography and Watermarking Techniques · Visual Attention and Saliency Detection

MethodsDiffusion