End-to-End Vision Tokenizer Tuning

Wenxuan Wang; Fan Zhang; Yufeng Cui; Haiwen Diao; Zhuoyan Luo; Huchuan Lu; Jing Liu; Xinlong Wang

arXiv:2505.10562·cs.CV·May 16, 2025

End-to-End Vision Tokenizer Tuning

Wenxuan Wang, Fan Zhang, Yufeng Cui, Haiwen Diao, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang

PDF

Open Access

TL;DR

This paper introduces ETT, an end-to-end training method for vision tokenizers that jointly optimizes them with downstream tasks, significantly improving performance in multimodal understanding and generation.

Contribution

ETT is a novel approach that enables joint optimization of vision tokenizers with downstream tasks, addressing the misalignment caused by decoupled training.

Findings

01

Achieves 2-6% performance improvements in multimodal tasks.

02

Maintains original reconstruction capabilities.

03

Easily integrates into existing training pipelines.

Abstract

Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Optical Imaging Technologies · Advanced Vision and Imaging