Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Dong Shu; Haiyan Zhao; Jingyu Hu; Weiru Liu; Ali Payani; Lu Cheng; Mengnan Du

arXiv:2501.01346·cs.CV·September 24, 2025

Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, Mengnan Du

PDF

Open Access 1 Video

TL;DR

This survey explores the alignment and misalignment issues in Large Vision-Language Models, analyzing their causes, mitigation strategies, and emphasizing the importance of explainability for future improvements.

Contribution

It provides a comprehensive review of alignment phenomena, categorizes misalignment causes, and discusses mitigation strategies and future research directions in LVLMs.

Findings

01

Misalignment occurs at object, attribute, and relational levels.

02

Mitigation strategies include parameter-frozen and parameter-tuning methods.

03

Explainability is crucial for understanding and improving LVLM alignment.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling