Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

Bidur Khanal; Sandesh Pokhrel; Sanjay Bhandari; Ramesh Rana; Nikesh Shrestha; Ram Bahadur Gurung; Cristian Linte; Angus Watson; Yash Raj Shrestha; Binod Bhattarai

arXiv:2505.07001·cs.CV·June 24, 2025

Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

Bidur Khanal, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana, Nikesh Shrestha, Ram Bahadur Gurung, Cristian Linte, Angus Watson, Yash Raj Shrestha, Binod Bhattarai

PDF

Open Access 1 Repo

TL;DR

This paper introduces Gut-VLM, a multimodal GI dataset with annotated hallucinations, and proposes a hallucination-aware finetuning method that improves medical image report accuracy in vision-language models.

Contribution

The paper creates a high-quality, annotated GI dataset with hallucination labels and develops a novel hallucination-aware finetuning approach for VLMs.

Findings

01

Hallucination-aware finetuning outperforms traditional report generation finetuning.

02

Gut-VLM dataset enables effective benchmarking of VLMs in medical imaging.

03

State-of-the-art VLMs show reduced hallucinations after proposed finetuning.

Abstract

Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bhattarailab/hallucination-aware-vlm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Image Retrieval and Classification Techniques