LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan; Dingjie Song; Zhe Fang; Yisheng Ji; Xiang Li; Quanzheng Li; Lichao Sun

arXiv:2602.10367·cs.AI·February 12, 2026

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

PDF

Open Access

TL;DR

LiveMedBench is a novel, continuously updated medical benchmark that ensures contamination-free evaluation of LLMs using automated, rubric-based assessment aligned with expert standards, addressing limitations of existing static benchmarks.

Contribution

We introduce LiveMedBench, a contamination-free, real-time medical benchmark with an automated rubric evaluation framework, enhancing the reliability and relevance of LLM assessments in clinical contexts.

Findings

01

Best LLM achieves only 39.2% accuracy on benchmark cases.

02

84% of models show performance decline on post-cutoff cases.

03

Contextual application issues are the main source of errors.

Abstract

The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling