ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Yin Xie; Kaicheng Yang; Peirou Liang; Xiang An; Yongle Zhao; Yumeng Wang; Ziyong Feng; Roy Miles; Ismail Elezi; Jiankang Deng

arXiv:2410.14332·cs.CV·August 14, 2025

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Yin Xie, Kaicheng Yang, Peirou Liang, Xiang An, Yongle Zhao, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng

PDF

Open Access 1 Repo 1 Video

TL;DR

ViCToR introduces a novel pretraining framework for large multimodal models that enhances visual understanding by reconstructing visual tokens, leading to state-of-the-art performance on multiple benchmarks.

Contribution

The paper proposes ViCToR, a new pretraining method that improves visual token representation in LMMs using token reconstruction and semantic supervision.

Findings

01

Achieves state-of-the-art results on MMStar, SEED$^I$, and RealWorldQA benchmarks.

02

Improves LLaVA-NeXT-8B performance by over 10% on key benchmarks.

03

Demonstrates the effectiveness of visual token reconstruction in LMM pretraining.

Abstract

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepglint/croc
pytorchOfficial

Videos

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs· underline

Taxonomy

TopicsSpeech and dialogue systems · Semantic Web and Ontologies · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need