# Accurate discharge summary generation using fine tuned large language models with self evaluation

**Authors:** Wenbin Li, Hui Feng, Chao Hu, Minpeng Xu, Longlong Cheng

PMC · DOI: 10.1038/s41598-026-35552-z · 2026-01-17

## TL;DR

This paper presents a new AI framework that improves the accuracy and efficiency of generating medical discharge summaries using advanced language models and self-evaluation techniques.

## Contribution

A novel framework combining DoRA fine-tuning and a self-evaluation mechanism for improved discharge summary generation.

## Key findings

- The self-evaluation mechanism improved BERTScore by 6.9% and ROUGE-L by 69.6% compared to few-shot prompting.
- DoRA outperformed traditional methods like LoRA and QLoRA in BERTScore and Perplexity metrics.
- Generated summaries showed consistent gains in accuracy and completeness, reducing clinician workload.

## Abstract

Discharge summaries are critical for patient care continuity, clinical decision-making, and legal documentation, yet their creation is labor-intensive. Clinicians must manually integrate diverse data from multiple sources under time constraints, often leading to delays, inconsistencies, and potential omissions. This study introduces a novel framework to automate discharge summary generation using advanced natural language processing (NLP) techniques, aiming to reduce clinician workload while ensuring accurate, complete, and standardized documentation. We combine the Decomposed Low-Rank Adaptation (DoRA) fine-tuning method with a novel self-evaluation mechanism to enhance large language models (LLMs) for medical text generation. DoRA efficiently adapts pre-trained LLMs to the specialized medical domain, demonstrating superior performance over traditional methods such as LoRA and QLoRA, with a enhancement in BERTScore and a reduction in Perplexity across all evaluated models. The self-evaluation mechanism, inspired by cognitive psychology, iteratively re-feeds generated summaries together with segmented clinical data into the model, allowing it to systematically detect and correct omissions in each data segment, thereby ensuring the outputs accurately and comprehensively represent the original input. This approach was rigorously compared against few-shot prompting and Chain of Thought (CoT) methods. Extensive experiments show that self-evaluation improves BERTScore by 6.9% and 4.1% and increases ROUGE-L by 69.6% and 0.4% relative to few-shot and CoT baselines, respectively, while qualitative metrics also demonstrate consistent gains in accuracy and completeness. Our results demonstrate substantial enhancements in the quality and consistency of generated discharge summaries while reducing the time required for their creation. This research underscores the potential of AI-driven tools in healthcare documentation, reducing the time required for generating discharge summaries while improving their quality and consistency. The findings indicate promising prospects for automating medical documentation that adheres to high standards of accuracy and relevance.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12891688/full.md

---
Source: https://tomesphere.com/paper/PMC12891688