Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data
John Cook, Michael Wyatt, Peng Wei, Iris Chin, Santosh Gupta, Van Zyl Van Vuuren, Richie Siburian, Amanda Spicer, Kristen Viviano, Alda Cami, Raunaq Malhotra, Zhewei Yao, Jeff Rasley, Gaurav Kaushik

TL;DR
This paper demonstrates that fine-tuning a large language model with synthetic, privacy-preserving clinical data significantly improves its accuracy in automated medical coding, achieving high exact-match F1 scores for ICD-10-CM and CPT codes.
Contribution
The study shows that using synthetic training data enables a general-purpose LLM to excel at medical coding tasks without risking patient privacy, surpassing zero-shot performance.
Findings
Exact-match F1 score exceeded 0.70 after fine-tuning.
Performance remained high on complex, reasoning-intensive categories.
Synthetic data effectively trains models for real-world medical coding applications.
Abstract
Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Medical Coding and Health Information · Artificial Intelligence in Healthcare and Education
