A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients
Joakim Edin, Sedrah Butt Balaganeshan, Annike Kj{\o}lby Kristensen, Lars Maal{\o}e, Ioannis Louloudis, S{\o}ren Brunak

TL;DR
This study developed a large-scale language model trained on extensive clinical data to automate medical coding, achieving high accuracy and revealing potential under-coding issues in healthcare records.
Contribution
The paper introduces a novel language model trained on 5.8 million health records for ICD-10 coding, demonstrating improved automation and insights into under-coding in clinical documentation.
Findings
Model achieved 71.8% micro F1 score on ICD-10 coding.
High top-10 recall of 95.5%, indicating effective candidate suggestions.
Identified systematic under-coding of secondary diagnoses, with 76-86% validation upon review.
Abstract
Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006--2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53--91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Coding and Health Information · Machine Learning in Healthcare · Chronic Disease Management Strategies
