LiveClin: A Live Clinical Benchmark without Leakage
Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang

TL;DR
LiveClin is a continuously updated, real-world clinical benchmark for evaluating medical language models, addressing data contamination and obsolescence issues to better reflect actual clinical practice.
Contribution
We introduce LiveClin, a novel live benchmark built from current peer-reviewed cases, enabling realistic, ongoing evaluation of medical LLMs without data leakage.
Findings
Top model achieved 35.7% case accuracy
Human experts outperformed most models
LiveClin presents a challenging, realistic evaluation environment
Abstract
The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human…
Peer Reviews
Decision·ICLR 2026 Poster
* Pilot study convincingly shows 10-point performance drop on post-cutoff data. * The three-tier taxonomy (ICD-10 chapters, disease clusters, individual codes) enables multi-resolution analysis while ensuring broad disease representation. * The 239-physician verification pipeline with both annotation and inspection phases demonstrates exceptional attention to clinical validity. * The multimodal integration is high quality, where images are naturally embedded in clinical workflow. * Reveals n
* The ablation study shows AI generates more "challenging" questions (lower trivial ratio), but doesn't validate whether this difficulty stems from genuine clinical complexity or artifacts of the generation process. * he zero-shot, conversational evaluation may disadvantage models not optimized for this specific format. It's possible that the performance differences reflect not clinical reasoning ability but the adaptation to the evaluation format. Adding few-shot experiments would help. * Mai
Contamination and the rapid evolution of medical knowledge are significant concerns for the evaluation in the medical domain. This approach presents a solution to both of these issues. The dataset is sufficiently large, and the multistep approach is a welcome addition compared to previous evaluations that test zero-shot knowledge on a complete vignette. The dataset is also multimodal, integrating imaging, labs, and other signals. It was also validated by a large number of clinicians, which stren
While the method is solid, I am concerned by the reliance on case reports, as, by definition, case reports are published to communicate unusual or rare cases to the medical community. This reliance on unusual/rare cases induces a bias in the knowledge and reasoning capabilities of the models. I am also concerned about the lack of a physician baseline to compare the accuracy of models with what is expected of an attending physician. The reported metric for case accuracy scores appears too strict
1. The overall logic is fairly clear, and the writing is relatively well-structured. 2. Built from the latest peer-reviewed clinical cases and updated biannually, it effectively mitigates data leakage and knowledge obsolescence issues that plague static benchmarks, ensuring long-term clinical relevance of evaluations. 3. It simulates the patient care process (from initial consultation to long-term management) and integrates diverse multimodal data (images, tables, etc.), reflecting real-world c
Major Comments 1.The benchmark is constructed using case reports from the first half of 2025 in the PubMed Central (PMC) Open Access subset. Could there still be potential data leakage risks for some newly released models such as GPT-5? 2.Does multiple physicians verify the same piece of data? If yes, what is the inter-annotator agreement (e.g., Cohen’s Kappa coefficient) among different physicians? 3.The paper proposes "updating the benchmark biannually" but lacks details on the specific c
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Electronic Health Records Systems · Artificial Intelligence in Healthcare and Education
