Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Aakriti Agrawal; Gouthaman KV; Rohith Aralikatti; Gauri Jagatap; Jiaxin Yuan; Vijay Kamarshi; Andrea Fanelli; Furong Huang

arXiv:2511.05017·cs.CV·November 10, 2025

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, Furong Huang

PDF

Open Access

TL;DR

This paper addresses hallucinations in large vision-language models by refining textual embeddings with visual features, significantly improving visual grounding and reducing hallucinations.

Contribution

It introduces a simple method to incorporate visual information into textual embeddings, highlighting modality bias and its mitigation in LVLMs.

Findings

01

Refining textual embeddings improves visual grounding.

02

The method significantly reduces hallucinations.

03

Average pooling is an effective fusion technique.

Abstract

In this work, we identify an inherent bias in prevailing LVLM architectures toward the language modality, largely resulting from the common practice of simply appending visual embeddings to the input text sequence. To address this, we propose a simple yet effective method that refines textual embeddings by integrating average-pooled visual features. Our approach demonstrably improves visual grounding and significantly reduces hallucinations on established benchmarks. While average pooling offers a straightforward, robust, and efficient means of incorporating visual information, we believe that more sophisticated fusion methods could further enhance visual grounding and cross-modal alignment. Given that the primary focus of this work is to highlight the modality imbalance and its impact on hallucinations -- and to show that refining textual embeddings with visual information mitigates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis