Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

Qing Ding; Eric Hua Qing Zhang; Felix Jozsa; Julia Ive

arXiv:2511.01053·cs.CL·November 4, 2025

Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

Qing Ding, Eric Hua Qing Zhang, Felix Jozsa, Julia Ive

PDF

Open Access

TL;DR

This paper presents a validated, guideline-based dataset for evaluating clinical language models, created from NICE guidelines using GPT, enabling systematic assessment of models' clinical reasoning and guideline adherence.

Contribution

It introduces a novel, standardized dataset derived from NICE guidelines for benchmarking clinical LLMs, filling a gap in healthcare AI evaluation.

Findings

01

The dataset effectively evaluates LLMs' clinical reasoning.

02

Benchmarking shows variability in model performance.

03

The framework supports systematic clinical utility assessment.

Abstract

Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling