TL;DR
MedHallTune is a large-scale benchmark designed to evaluate and reduce hallucinations in medical vision-language models, improving their reliability for healthcare applications.
Contribution
This work introduces MedHallTune, the first comprehensive benchmark with over 1 million instruction pairs to assess and mitigate hallucinations in medical VLMs.
Findings
Fine-tuning with MedHallTune reduces hallucinations in models.
Improves models' zero-shot performance on medical VQA tasks.
Enhances trustworthiness of medical vision-language models.
Abstract
The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
