EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in   Instructional Multimodal Models

Andr\'es Villa; Juan Le\'on Alc\'azar; Motasem Alfarra; Vladimir; Araujo; Alvaro Soto; Bernard Ghanem

arXiv:2501.02699·cs.CV·January 7, 2025

EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

Andr\'es Villa, Juan Le\'on Alc\'azar, Motasem Alfarra, Vladimir, Araujo, Alvaro Soto, Bernard Ghanem

PDF

Open Access

TL;DR

EAGLE enhances the visual component of multimodal models through a simple reformulation of contrastive pre-training, significantly reducing hallucinations and improving grounding without additional instruction training.

Contribution

The paper introduces EAGLE, a post-pretraining method that improves visual grounding and reduces hallucinations in multimodal models by reformulating contrastive pre-training.

Findings

01

EAGLE reduces hallucinations across multiple benchmarks.

02

The method improves visual grounding without extra instruction training.

03

EAGLE is agnostic to the language model and fusion module.

Abstract

Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics

MethodsFocus