FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability
Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar, Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal

TL;DR
FiVL introduces a comprehensive framework with new datasets, training tasks, and evaluation benchmarks to improve visual grounding and explainability in vision-language models, addressing hallucinations and reliance on linguistic priors.
Contribution
The paper presents a novel dataset construction method, a training task, and benchmarks to enhance visual grounding and interpretability in LVLMs, which is a significant advancement over existing approaches.
Findings
Enhanced performance in visual grounding tasks.
Effective identification of attention heads for explainability.
Improved evaluation of image necessity in model responses.
Abstract
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
