ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs
Yin Xie, Kaicheng Yang, Peirou Liang, Xiang An, Yongle Zhao, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng

TL;DR
ViCToR introduces a novel pretraining framework for large multimodal models that enhances visual understanding by reconstructing visual tokens, leading to state-of-the-art performance on multiple benchmarks.
Contribution
The paper proposes ViCToR, a new pretraining method that improves visual token representation in LMMs using token reconstruction and semantic supervision.
Findings
Achieves state-of-the-art results on MMStar, SEED$^I$, and RealWorldQA benchmarks.
Improves LLaVA-NeXT-8B performance by over 10% on key benchmarks.
Demonstrates the effectiveness of visual token reconstruction in LMM pretraining.
Abstract
Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Semantic Web and Ontologies · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
