Mitigating Image Captioning Hallucinations in Vision-Language Models

Fei Zhao; Chengcui Zhang; Runlin Zhang; Tianyang Wang; Xi Li

arXiv:2505.03420·cs.MM·June 10, 2025

Mitigating Image Captioning Hallucinations in Vision-Language Models

Fei Zhao, Chengcui Zhang, Runlin Zhang, Tianyang Wang, Xi Li

PDF

Open Access

TL;DR

This paper introduces a test-time adaptation method using reinforcement learning to significantly reduce hallucinations in vision-language models during inference without retraining or auxiliary models.

Contribution

It presents a novel reinforcement learning framework that updates layer normalization parameters at test time to mitigate hallucinations efficiently.

Findings

01

Achieves 15.4% and 17.3% reduction in hallucination rates on LLaVA and InstructBLIP.

02

Outperforms state-of-the-art methods with a 68.3% improvement in hallucination mitigation.

03

Operates by updating only 0.003% of model parameters during inference.

Abstract

Hallucinations in vision-language models (VLMs) hinder reliability and real-world applicability, usually stemming from distribution shifts between pretraining data and test samples. Existing solutions, such as retraining or fine-tuning on additional data, demand significant computational resources and labor-intensive data collection, while ensemble-based methods incur additional costs by introducing auxiliary VLMs. To address these challenges, we propose a novel test-time adaptation framework using reinforcement learning to mitigate hallucinations during inference without retraining or any auxiliary VLMs. By updating only the learnable parameters in the layer normalization of the language model (approximately 0.003% of the model parameters), our method reduces distribution shifts between test samples and pretraining samples. A CLIP-based hallucination evaluation model is proposed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis

MethodsLayer Normalization