Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation
Bhavik Vachhani, Kush Shrisvastava, Pranshu Nema, Sai Chiranthan

TL;DR
This paper critiques current LLM evaluation methods for clinical documentation, showing they overestimate hallucinations by ignoring clinical reasoning, and proposes a more context-aware assessment approach.
Contribution
It introduces a clinically grounded evaluation framework that reduces false hallucination detection by aligning metrics with medical reasoning and ontology-based retrieval.
Findings
Lexical evaluation reports 35% hallucinations, which drops to 9% with inference-aware methods.
Many flagged hallucinations are legitimate clinical transformations like synonym mapping and inference.
Clinically informed evaluation better distinguishes true errors from valid reasoning.
Abstract
Evaluating large language models (LLMs) for clinical documentation tasks such as SOAP note generation remains challenging. Unlike standard summarization, these tasks require clinical abstraction, normalization of colloquial language, and medically grounded inference. However, prevailing evaluation methods including automated metrics and LLM as judge frameworks rely on lexical faithfulness, often labeling any information not explicitly present in the transcript as hallucination. We show that such approaches systematically misclassify clinically valid outputs as errors, inflating hallucination rates and distorting model assessment. Our analysis reveals that many flagged hallucinations correspond to legitimate clinical transformations, including synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning. By aligning evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
