A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan; Shuyu Dai; Jinglu Wang; Fengtao Zhou; Yan Lu; Xi Wang; Yingcong Chen; Can Yang; Shujie Liu; Hao Chen

arXiv:2603.25196·cs.CL·March 27, 2026

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen

PDF

Open Access

TL;DR

This paper introduces CPGBench, a benchmark for evaluating LLMs' ability to detect and adhere to clinical practice guidelines in multi-turn medical conversations, revealing significant gaps in current models' performance.

Contribution

The paper presents the first systematic benchmark, CPGBench, assessing LLMs' clinical guideline detection and adherence, with extensive data and human validation.

Findings

01

Detection accuracy ranges from 71.1% to 89.6%.

02

Adherence rates vary from 21.8% to 63.2%.

03

Large gaps exist between knowing and applying guidelines.

Abstract

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsClinical practice guidelines implementation · Electronic Health Records Systems · Artificial Intelligence in Healthcare and Education