Large Language Model Benchmarks in Medical Tasks

Lawrence K.Q. Yan; Qian Niu; Ming Li; Yichao Zhang; Caitlyn Heqi Yin; Cheng Fei; Benji Peng; Ziqian Bi; Pohsun Feng; Keyu Chen; Tianyang Wang; Yunze Wang; Silin Chen; Ming Liu; Junyu Liu; Xinyuan Song; Riyang Bao; Zekun Jiang; Ziyuan Qin

arXiv:2410.21348·cs.CL·November 13, 2025·3 cites

Large Language Model Benchmarks in Medical Tasks

Lawrence K.Q. Yan, Qian Niu, Ming Li, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Benji Peng, Ziqian Bi, Pohsun Feng, Keyu Chen, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu, Junyu Liu, Xinyuan Song, Riyang Bao, Zekun Jiang, Ziyuan Qin

PDF

Open Access

TL;DR

This paper surveys various benchmark datasets used to evaluate large language models in medical tasks, covering multiple modalities and highlighting their role in advancing clinical AI applications.

Contribution

It provides a comprehensive categorization and analysis of medical benchmark datasets, discussing their significance, challenges, and future opportunities for multimodal medical AI development.

Findings

01

Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, CheXpert

02

Benchmarks have facilitated progress in report generation and clinical summarization

03

Identifies challenges like language diversity and data synthesis in medical benchmarks

Abstract

With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling