Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Akram Mustafa; Usman Naseem; and Mostafa Rahimi Azghadi

arXiv:2507.03001·cs.CL·July 8, 2025

Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Akram Mustafa, Usman Naseem, and Mostafa Rahimi Azghadi

PDF

TL;DR

This paper assesses the ability of various large language models to classify ICD-10 codes from hospital summaries, revealing current limitations and the potential for reasoning-enhanced models to improve healthcare coding accuracy.

Contribution

It provides a comprehensive evaluation of LLMs on clinical coding tasks, highlighting the performance gap and the benefits of reasoning capabilities in models.

Findings

01

None of the models exceeded 57% F1 score.

02

Reasoning-based models generally outperformed non-reasoning models.

03

Performance decreased with increasing code specificity.

Abstract

This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.