HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation
Haoyu Wang, Zitong Li

TL;DR
HalluCXR introduces a benchmark for evaluating hallucinations in medical vision-language models, revealing high hallucination rates and proposing detection and mitigation strategies for safer clinical use.
Contribution
The paper presents a comprehensive benchmark, an annotation taxonomy, and ensemble mitigation methods to address hallucinations in medical VLMs for chest radiograph interpretation.
Findings
61.9--82.3% of outputs contain hallucinations
Normal radiographs attract the most severe hallucinations
Ensemble methods reduce hallucinations by up to 84.8%
Abstract
Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
