TokBench: Evaluating Your Visual Tokenizer before Visual Generation

Junfeng Wu; Dongliang Luo; Weizhi Zhao; Zhihao Xie; Yuanhao Wang; Junyi Li; Xudong Xie; Yuliang Liu; Xiang Bai

arXiv:2505.18142·cs.CV·May 27, 2025

TokBench: Evaluating Your Visual Tokenizer before Visual Generation

Junfeng Wu, Dongliang Luo, Weizhi Zhao, Zhihao Xie, Yuanhao Wang, Junyi Li, Xudong Xie, Yuliang Liu, Xiang Bai

PDF

Open Access

TL;DR

This paper introduces TokBench, a lightweight benchmark for evaluating the reconstruction quality of visual tokenizers and VAEs on text and face images, revealing their limitations in preserving fine details.

Contribution

The paper presents a novel, efficient benchmark for assessing visual tokenizer performance on challenging content, highlighting their shortcomings in fine-grained feature preservation.

Findings

01

Modern visual tokenizers struggle with fine-grained features at small scales.

02

Traditional metrics do not accurately reflect reconstruction quality for faces and text.

03

The benchmark is lightweight, requiring only 2GB memory and 4 minutes to run.

Abstract

In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInteractive and Immersive Displays · Data Visualization and Analytics · Multimedia Communication and Technology

MethodsFocus · Sparse Evolutionary Training