TL;DR
This paper introduces VIHD, a novel hallucination detection method for medical visual question answering that uses visual token masking and semantic entropy calibration to improve detection accuracy.
Contribution
VIHD leverages visual dependency probing and targeted visual intervention decoding to enhance hallucination detection in medical VQA models, outperforming existing methods.
Findings
VIHD outperforms state-of-the-art hallucination detection methods on three medical VQA benchmarks.
Visual dependency plays a crucial role in effective hallucination detection.
Calibrated semantic entropy (CSE) provides a reliable signal for hallucination identification.
Abstract
While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
