Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs
Akram Mustafa, Usman Naseem, and Mostafa Rahimi Azghadi

TL;DR
This paper assesses the ability of various large language models to classify ICD-10 codes from hospital summaries, revealing current limitations and the potential for reasoning-enhanced models to improve healthcare coding accuracy.
Contribution
It provides a comprehensive evaluation of LLMs on clinical coding tasks, highlighting the performance gap and the benefits of reasoning capabilities in models.
Findings
None of the models exceeded 57% F1 score.
Reasoning-based models generally outperformed non-reasoning models.
Performance decreased with increasing code specificity.
Abstract
This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
