Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding

Truong Vo; Weiyi Wu; Kaize Ding

arXiv:2511.14112·cs.CL·November 19, 2025

Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding

Truong Vo, Weiyi Wu, Kaize Ding

PDF

Open Access

TL;DR

This paper introduces a data-centric approach that generates synthetic clinical notes to improve the prediction of rare ICD codes, addressing the long-tail distribution challenge in medical NLP.

Contribution

It presents a novel method for creating high-quality synthetic discharge summaries to augment training data for rare ICD codes, improving model performance.

Findings

01

Synthetic data improves macro-F1 scores.

02

Method outperforms prior state-of-the-art models.

03

Synthetic notes cover thousands of ICD codes.

Abstract

Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Medical Coding and Health Information