TL;DR
This paper introduces Med-HallMark, a comprehensive benchmark and evaluation framework for detecting and assessing hallucinations in medical vision-language models, aiming to improve their reliability in healthcare.
Contribution
It presents the first dedicated medical hallucination detection benchmark, a hierarchical scoring metric, and a specialized LVLM for precise hallucination detection.
Findings
MediHallScore offers nuanced hallucination impact assessment.
MediHallDetector outperforms existing models in hallucination detection.
Benchmark facilitates standardized evaluation of medical LVLMs.
Abstract
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
