Deep Pre-Alignment for VLMs
Tianyu Yu, Kechen Fang, Zihao Wan, Kaidong Zhang, Yicheng Zhang, Jun Song, Bo Zheng, Yuan Yao

TL;DR
Deep Pre-Alignment (DPA) improves vision-language models by replacing the standard encoder with a perceiver, leading to better alignment, enhanced performance across benchmarks, and reduced language forgetting.
Contribution
DPA introduces a novel architecture that deeply aligns visual features with text space, outperforming baselines and reducing language capability loss.
Findings
DPA outperforms baselines by 1.9 to 3.0 points on multimodal benchmarks.
DPA reduces language capability forgetting by 32.9%.
Gains are consistent across different LLM families.
Abstract
Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
