AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry; Juan A. Rodriguez; Tianyu Zhang; Suyuchen Wang; Chao Wang; Aarash Feizi; Akshay Kalkunte Suresh; Abhay Puri; Xiangru Jian; Pierre-Andr\'e No\"el; Sathwik Tejaswi Madhusudhan; Marco Pedersoli; Bang Liu; Nicolas Chapados; Yoshua Bengio; Enamul Hoque; Christopher Pal; Issam H. Laradji; David Vazquez; Perouz Taslakian; Spandana Gella; Sai Rajeswar

arXiv:2502.01341·cs.CL·November 4, 2025

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-Andr\'e No\"el, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal

PDF

Open Access

TL;DR

AlignVLM introduces a novel method for aligning visual features with language embeddings by mapping them to a weighted average of LLM text embeddings, improving multimodal document understanding especially in low-resource scenarios.

Contribution

The paper presents AlignVLM, a new alignment technique that leverages linguistic priors in LLMs to better map visual features, outperforming existing methods in document understanding tasks.

Findings

01

Achieves state-of-the-art performance on document understanding tasks.

02

More effective in low-resource settings compared to prior methods.

03

Demonstrates robustness to noise and efficiency in alignment.

Abstract

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Speech and dialogue systems