QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Yao Wu; Kangping Yin; Liang Dong; Zhenxin Ma; Shuting Xu; Xuehai Wang; Yuxuan Jiang; Tingting Yu; Yunqing Hong; Jiayi Liu; Rianzhe Huang; Shuxin Zhao; Haiping Hu; Wen Shang; Jian Xu; Guanjun Jiang

arXiv:2603.13691·cs.CL·March 17, 2026

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Yao Wu, Kangping Yin, Liang Dong, Zhenxin Ma, Shuting Xu, Xuehai Wang, Yuxuan Jiang, Tingting Yu, Yunqing Hong, Jiayi Liu, Rianzhe Huang, Shuxin Zhao, Haiping Hu, Wen Shang, Jian Xu, Guanjun Jiang

PDF

Open Access

TL;DR

QuarkMedBench is a comprehensive, real-world medical benchmark for evaluating large language models' ability to handle complex, unstructured medical queries with an automated, evidence-based scoring system.

Contribution

The paper introduces QuarkMedBench, a novel benchmark with an automated, multi-faceted scoring framework for assessing LLMs on real-world medical queries, addressing limitations of existing exam-based evaluations.

Findings

01

Achieves 91.8% concordance with clinical experts.

02

Reveals significant performance gaps among current models.

03

Provides a scalable, dynamic evaluation framework.

Abstract

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling