VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation
Huawei Lin, Tong Geng, Zhaozhuo Xu, Weijie Zhao

TL;DR
VTBench is a comprehensive benchmark for evaluating visual tokenizers in autoregressive image generation, revealing that continuous VAEs outperform discrete VTs in preserving image details and structure.
Contribution
The paper introduces VTBench, a new benchmark for systematically assessing visual tokenizers across multiple tasks, highlighting the superiority of continuous VAEs over discrete VTs.
Findings
Continuous VAEs outperform discrete VTs in image reconstruction.
Discrete VTs often lose fine details and text in images.
GPT-4o shows potential as an autoregressive image generator.
Abstract
Autoregressive (AR) models have recently shown strong performance in image generation, where a critical component is the visual tokenizer (VT) that maps continuous pixel inputs to discrete token sequences. The quality of the VT largely defines the upper bound of AR model performance. However, current discrete VTs fall significantly behind continuous variational autoencoders (VAEs), leading to degraded image reconstructions and poor preservation of details and text. Existing benchmarks focus on end-to-end generation quality, without isolating VT performance. To address this gap, we introduce VTBench, a comprehensive benchmark that systematically evaluates VTs across three core tasks: Image Reconstruction, Detail Preservation, and Text Preservation, and covers a diverse range of evaluation scenarios. We systematically assess state-of-the-art VTs using a set of metrics to evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques
MethodsFocus · Sparse Evolutionary Training
