FiVL: A Framework for Improved Vision-Language Alignment through the   Lens of Training, Evaluation and Explainability

Estelle Aflalo; Gabriela Ben Melech Stan; Tiep Le; Man Luo; Shachar; Rosenman; Sayak Paul; Shao-Yen Tseng; Vasudev Lal

arXiv:2412.14672·cs.CV·March 20, 2025

FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability

Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar, Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal

PDF

Open Access

TL;DR

FiVL introduces a comprehensive framework with new datasets, training tasks, and evaluation benchmarks to improve visual grounding and explainability in vision-language models, addressing hallucinations and reliance on linguistic priors.

Contribution

The paper presents a novel dataset construction method, a training task, and benchmarks to enhance visual grounding and interpretability in LVLMs, which is a significant advancement over existing approaches.

Findings

01

Enhanced performance in visual grounding tasks.

02

Effective identification of attention heads for explainability.

03

Improved evaluation of image necessity in model responses.

Abstract

Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling