Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

Bhavik Vachhani; Kush Shrisvastava; Pranshu Nema; Sai Chiranthan

arXiv:2604.14829·cs.AI·April 17, 2026

Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

Bhavik Vachhani, Kush Shrisvastava, Pranshu Nema, Sai Chiranthan

PDF

TL;DR

This paper critiques current LLM evaluation methods for clinical documentation, showing they overestimate hallucinations by ignoring clinical reasoning, and proposes a more context-aware assessment approach.

Contribution

It introduces a clinically grounded evaluation framework that reduces false hallucination detection by aligning metrics with medical reasoning and ontology-based retrieval.

Findings

01

Lexical evaluation reports 35% hallucinations, which drops to 9% with inference-aware methods.

02

Many flagged hallucinations are legitimate clinical transformations like synonym mapping and inference.

03

Clinically informed evaluation better distinguishes true errors from valid reasoning.

Abstract

Evaluating large language models (LLMs) for clinical documentation tasks such as SOAP note generation remains challenging. Unlike standard summarization, these tasks require clinical abstraction, normalization of colloquial language, and medically grounded inference. However, prevailing evaluation methods including automated metrics and LLM as judge frameworks rely on lexical faithfulness, often labeling any information not explicitly present in the transcript as hallucination. We show that such approaches systematically misclassify clinically valid outputs as errors, inflating hallucination rates and distorting model assessment. Our analysis reveals that many flagged hallucinations correspond to legitimate clinical transformations, including synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning. By aligning evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.