VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation

Huawei Lin; Tong Geng; Zhaozhuo Xu; Weijie Zhao

arXiv:2505.13439·cs.CV·May 20, 2025

VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation

Huawei Lin, Tong Geng, Zhaozhuo Xu, Weijie Zhao

PDF

Open Access 1 Repo 1 Datasets

TL;DR

VTBench is a comprehensive benchmark for evaluating visual tokenizers in autoregressive image generation, revealing that continuous VAEs outperform discrete VTs in preserving image details and structure.

Contribution

The paper introduces VTBench, a new benchmark for systematically assessing visual tokenizers across multiple tasks, highlighting the superiority of continuous VAEs over discrete VTs.

Findings

01

Continuous VAEs outperform discrete VTs in image reconstruction.

02

Discrete VTs often lose fine details and text in images.

03

GPT-4o shows potential as an autoregressive image generator.

Abstract

Autoregressive (AR) models have recently shown strong performance in image generation, where a critical component is the visual tokenizer (VT) that maps continuous pixel inputs to discrete token sequences. The quality of the VT largely defines the upper bound of AR model performance. However, current discrete VTs fall significantly behind continuous variational autoencoders (VAEs), leading to degraded image reconstructions and poor preservation of details and text. Existing benchmarks focus on end-to-end generation quality, without isolating VT performance. To address this gap, we introduce VTBench, a comprehensive benchmark that systematically evaluates VTs across three core tasks: Image Reconstruction, Detail Preservation, and Text Preservation, and covers a diverse range of evaluation scenarios. We systematically assess state-of-the-art VTs using a set of metrics to evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huawei-lin/VTBench
pytorchOfficial

Datasets

huaweilin/VTBench
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques

MethodsFocus · Sparse Evolutionary Training