Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs
Qing Ding, Eric Hua Qing Zhang, Felix Jozsa, Julia Ive

TL;DR
This paper presents a validated, guideline-based dataset for evaluating clinical language models, created from NICE guidelines using GPT, enabling systematic assessment of models' clinical reasoning and guideline adherence.
Contribution
It introduces a novel, standardized dataset derived from NICE guidelines for benchmarking clinical LLMs, filling a gap in healthcare AI evaluation.
Findings
The dataset effectively evaluates LLMs' clinical reasoning.
Benchmarking shows variability in model performance.
The framework supports systematic clinical utility assessment.
Abstract
Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling
