LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

TL;DR
LiveMedBench is a novel, continuously updated medical benchmark that ensures contamination-free evaluation of LLMs using automated, rubric-based assessment aligned with expert standards, addressing limitations of existing static benchmarks.
Contribution
We introduce LiveMedBench, a contamination-free, real-time medical benchmark with an automated rubric evaluation framework, enhancing the reliability and relevance of LLM assessments in clinical contexts.
Findings
Best LLM achieves only 39.2% accuracy on benchmark cases.
84% of models show performance decline on post-cutoff cases.
Contextual application issues are the main source of errors.
Abstract
The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling
