LiveClin: A Live Clinical Benchmark without Leakage

Xidong Wang; Shuqi Guo; Yue Shen; Junying Chen; Jian Wang; Jinjie Gu; Ping Zhang; Lei Liu; Benyou Wang

arXiv:2602.16747·cs.LG·February 20, 2026

LiveClin: A Live Clinical Benchmark without Leakage

Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

LiveClin is a continuously updated, real-world clinical benchmark for evaluating medical language models, addressing data contamination and obsolescence issues to better reflect actual clinical practice.

Contribution

We introduce LiveClin, a novel live benchmark built from current peer-reviewed cases, enabling realistic, ongoing evaluation of medical LLMs without data leakage.

Findings

01

Top model achieved 35.7% case accuracy

02

Human experts outperformed most models

03

LiveClin presents a challenging, realistic evaluation environment

Abstract

The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

* Pilot study convincingly shows 10-point performance drop on post-cutoff data. * The three-tier taxonomy (ICD-10 chapters, disease clusters, individual codes) enables multi-resolution analysis while ensuring broad disease representation. * The 239-physician verification pipeline with both annotation and inspection phases demonstrates exceptional attention to clinical validity. * The multimodal integration is high quality, where images are naturally embedded in clinical workflow. * Reveals n

Weaknesses

* The ablation study shows AI generates more "challenging" questions (lower trivial ratio), but doesn't validate whether this difficulty stems from genuine clinical complexity or artifacts of the generation process. * he zero-shot, conversational evaluation may disadvantage models not optimized for this specific format. It's possible that the performance differences reflect not clinical reasoning ability but the adaptation to the evaluation format. Adding few-shot experiments would help. * Mai

Reviewer 02Rating 6Confidence 5

Strengths

Contamination and the rapid evolution of medical knowledge are significant concerns for the evaluation in the medical domain. This approach presents a solution to both of these issues. The dataset is sufficiently large, and the multistep approach is a welcome addition compared to previous evaluations that test zero-shot knowledge on a complete vignette. The dataset is also multimodal, integrating imaging, labs, and other signals. It was also validated by a large number of clinicians, which stren

Weaknesses

While the method is solid, I am concerned by the reliance on case reports, as, by definition, case reports are published to communicate unusual or rare cases to the medical community. This reliance on unusual/rare cases induces a bias in the knowledge and reasoning capabilities of the models. I am also concerned about the lack of a physician baseline to compare the accuracy of models with what is expected of an attending physician. The reported metric for case accuracy scores appears too strict

Reviewer 03Rating 4Confidence 4

Strengths

1. The overall logic is fairly clear, and the writing is relatively well-structured. 2. Built from the latest peer-reviewed clinical cases and updated biannually, it effectively mitigates data leakage and knowledge obsolescence issues that plague static benchmarks, ensuring long-term clinical relevance of evaluations. 3. It simulates the patient care process (from initial consultation to long-term management) and integrates diverse multimodal data (images, tables, etc.), reflecting real-world c

Weaknesses

Major Comments 1.The benchmark is constructed using case reports from the first half of 2025 in the PubMed Central (PMC) Open Access subset. Could there still be potential data leakage risks for some newly released models such as GPT-5? 2.Does multiple physicians verify the same piece of data? If yes, what is the inter-annotator agreement (e.g., Cohen’s Kappa coefficient) among different physicians? 3.The paper proposes "updating the benchmark biannually" but lacks details on the specific c

Code & Models

Datasets

AQ-MedAI/LiveClin
dataset· 658 dl
658 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Electronic Health Records Systems · Artificial Intelligence in Healthcare and Education