Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Kristine Ann M. Carandang; Jasper Meynard P. Ara\~na; Ethan Robert A. Casin; Christopher P. Monterola; Daniel Stanley Y. Tan; Jesus Felix B. Valenzuela; Christian M. Alis

arXiv:2505.17095·cs.CL·August 26, 2025

Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Kristine Ann M. Carandang, Jasper Meynard P. Ara\~na, Ethan Robert A. Casin, Christopher P. Monterola, Daniel Stanley Y. Tan, Jesus Felix B. Valenzuela, Christian M. Alis

PDF

1 Video

TL;DR

This study evaluates the reliability of 12 large language models in clinical note generation, focusing on their consistency, semantic accuracy, and correctness to support healthcare documentation.

Contribution

It provides a comprehensive comparison of open-weight and proprietary LLMs' reliability in clinical note generation, highlighting the most stable and accurate models.

Findings

01

LLMs are generally semantically consistent across responses

02

Most models produce notes close to expert annotations

03

Meta's Llama 70B is the most reliable model

Abstract

Due to the legal and ethical responsibilities of healthcare providers (HCPs) for accurate documentation and protection of patient data privacy, the natural variability in the responses of large language models (LLMs) presents challenges for incorporating clinical note generation (CNG) systems, driven by LLMs, into real-world clinical processes. The complexity is further amplified by the detailed nature of texts in CNG. To enhance the confidence of HCPs in tools powered by LLMs, this study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG in terms of their ability to generate notes that are string equivalent (consistency rate), have the same meaning (semantic consistency) and are correct (semantic similarity), across several iterations using the same prompt. The results show that (1) LLMs from all model families are stable,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation· underline