Clinical Note Bloat Reduction for Efficient LLM Use
Jordan L. Cahoon, Chloe Stanwyck, Asad Aali, Rachel Madding, Emma Sun, Yixing Jiang, Renumathy Dhanasekaran, Emily Alsentzer

TL;DR
This paper presents TRACE, a scalable method to reduce clinical note bloat by leveraging EHR metadata and deduplication, significantly decreasing computational costs while maintaining model performance.
Contribution
The paper introduces TRACE, a novel preprocessing pipeline that effectively reduces note bloat in clinical texts using EHR metadata and deduplication techniques.
Findings
TRACE removed 47.3% of chart text
Preserved performance for information extraction and outcome prediction
Estimated $9.5 million annual cost savings at a medical center
Abstract
Health systems are rapidly deploying large language models (LLMs) that use clinical notes for clinical decision support applications. However, modern documentation practices rely heavily on templates, copy--paste shortcuts, and auto-populated fields, producing extensive duplicated text (``note bloat'') that dilutes clinically meaningful signal and substantially increases the computational cost of LLM use. We introduce TRACE, a scalable preprocessing pipeline that removes note bloat by leveraging EHR attribution metadata to identify templated and copied content and applying frequency-based deduplication when metadata are unavailable. We evaluated TRACE across four real--world clinical cohorts spanning liver transplantation, obstetrics, and inpatient care (5.3 million notes) using blinded physician review and downstream modeling tasks. TRACE removed 47.3% of chart text while preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
