Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, Mengnan Du

TL;DR
This survey explores the alignment and misalignment issues in Large Vision-Language Models, analyzing their causes, mitigation strategies, and emphasizing the importance of explainability for future improvements.
Contribution
It provides a comprehensive review of alignment phenomena, categorizes misalignment causes, and discusses mitigation strategies and future research directions in LVLMs.
Findings
Misalignment occurs at object, attribute, and relational levels.
Mitigation strategies include parameter-frozen and parameter-tuning methods.
Explainability is crucial for understanding and improving LVLM alignment.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
