MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

Qing He (1); Dongsheng Bi (1); Jianrong Lu (1; 2); Minghui Yang (1); Zixiao Chen (1); Jiacheng Lu (1); Jing Chen (1); Nannan Du (1); Xiao Cu (1); Sijing Wu (3); Peng Xiang (4); Yinyin Hu (3); Yi Guo (3); Chunpu Li (3); Shaoyang Li (1); Zhuo Dong (1); Ming Jiang (1); Shuai Guo (1); Liyun Feng (1); Jin Peng (1); Jian Wang (1); Jinjie Gu (1); Junwei Liu (1; 5) ((1) Ant Group; Hangzhou; China; (2) Zhejiang University; Hangzhou; China; (3) Health Information Center of Zhejiang Province; Hangzhou; China; (4) Department of AI; IT; The Second Affiliated Hospital; School of Medicine; Zhejiang University; Hangzhou; China; (5) School of Software; Microelectronics; Peking University; Beijing; China)

arXiv:2601.06193·cs.LG·January 13, 2026

MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

Qing He (1), Dongsheng Bi (1), Jianrong Lu (1, 2), Minghui Yang (1), Zixiao Chen (1), Jiacheng Lu (1), Jing Chen (1), Nannan Du (1), Xiao Cu (1), Sijing Wu (3), Peng Xiang (4), Yinyin Hu (3), Yi Guo (3), Chunpu Li (3), Shaoyang Li (1), Zhuo Dong (1), Ming Jiang (1)

PDF

Open Access

TL;DR

This paper introduces MLB, a comprehensive benchmark for evaluating large language models in clinical settings, emphasizing real-world utility through scenario-based assessments and expert-validated evaluation methods.

Contribution

It presents a new scenario-driven benchmark with diverse datasets and a specialized judge model, addressing gaps in existing static knowledge tests for clinical LLM evaluation.

Findings

01

Top model achieves 77.3% accuracy overall

02

Performance drops to 61.3% in patient-facing scenarios

03

Targeted training improves safety scores to 90.6%

Abstract

The proliferation of Large Language Models (LLMs) presents transformative potential for healthcare, yet practical deployment is hindered by the absence of frameworks that assess real-world clinical utility. Existing benchmarks test static knowledge, failing to capture the dynamic, application-oriented capabilities required in clinical practice. To bridge this gap, we introduce a Medical LLM Benchmark MLB, a comprehensive benchmark evaluating LLMs on both foundational knowledge and scenario-based reasoning. MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare). The benchmark integrates 22 datasets (17 newly curated) from diverse Chinese clinical sources, covering 64 clinical specialties. Its design features a rigorous curation pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling