The NordDRG AI Benchmark for Large Language Models
Tapio Pitk\"aranta

TL;DR
This paper introduces NordDRG-AI-Benchmark, a comprehensive open test bed for evaluating large language models' ability to understand and emulate hospital DRG grouping logic, crucial for healthcare funding transparency.
Contribution
It provides the first public, rule-complete benchmark for DRG reasoning, including detailed tables, governance workflows, and exact-match evaluation tasks for LLMs.
Findings
GPT-5 achieves perfect scores on logic tasks
GPT-5 partially emulates NordDRG grouper logic
Benchmark enables reproducible evaluation of LLMs in healthcare funding
Abstract
Large language models (LLMs) are being piloted for clinical coding and decision support, yet no open benchmark targets the hospital-funding layer where Diagnosis-Related Groups (DRGs) determine reimbursement. In most OECD systems, DRGs route a substantial share of multi-trillion-dollar health spending through governed grouper software, making transparency and auditability first-order concerns. We release NordDRG-AI-Benchmark, the first public, rule-complete test bed for DRG reasoning. The package includes (i) machine-readable approximately 20-sheet NordDRG definition tables and (ii) expert manuals and change-log templates that capture governance workflows. It exposes two suites: a 13-task Logic benchmark (code lookup, cross-table inference, grouping features, multilingual terminology, and CC/MCC validity checks) and a 13-task Grouper benchmark that requires full DRG grouper emulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsSparse Evolutionary Training
