From Visuals to Vocabulary: Establishing Equivalence Between Image and   Text Token Through Autoregressive Pre-training in MLLMs

Mingxiao Li; Fang Qu; Zhanpeng Chen; Na Su; Zhizhou Zhong; Ziyang; Chen; Nan Du; Xiaolong Li

arXiv:2502.09093·cs.CV·February 14, 2025

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

Mingxiao Li, Fang Qu, Zhanpeng Chen, Na Su, Zhizhou Zhong, Ziyang, Chen, Nan Du, Xiaolong Li

PDF

Open Access

TL;DR

This paper introduces VDEP, a novel pretraining method for multimodal large language models that enhances image-text alignment by reconstructing detailed visual features through dynamic embeddings, leading to improved performance across benchmarks.

Contribution

The work presents a hybrid autoregressive pretraining paradigm that effectively integrates visual information into MLLMs without architectural modifications, emphasizing detailed visual feature reconstruction.

Findings

01

VDEP outperforms existing methods on 13 benchmarks.

02

It improves multimodal alignment and visual feature reconstruction.

03

The approach seamlessly integrates into standard models.

Abstract

While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual features.The proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Translation Studies and Practices