Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

Suhas BN; Han-Chin Shing; Lei Xu; Mitch Strong; Jon Burnsky; Jessica Ofor; Jordan R. Mason; Susan Chen; Sundararajan Srinivasan; Chaitanya Shivade; Jack Moriarty; Joseph Paul Cohen

arXiv:2506.00448·cs.CL·June 3, 2025

Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen

PDF

Open Access

TL;DR

This paper evaluates hallucination detection in medical text summarization, showing general detectors struggle with clinical data, and introduces fact-based, explainable methods that generalize well to real-world clinical hallucinations.

Contribution

The study constructs specialized datasets and develops fact-based hallucination detection methods that improve explainability and generalization in clinical summarization.

Findings

01

General-domain detectors underperform on clinical hallucinations.

02

Fact-controlled datasets reveal limitations of existing detectors.

03

LLM-based detectors trained on fact-controlled data generalize to real clinical hallucinations.

Abstract

Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Advanced Text Analysis Techniques · Machine Learning in Healthcare