MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo; Shang Qu; Yifei Li; Zhangren Chen; Xuekai Zhu; Ermo Hua; Kaiyan Zhang; Ning Ding; Bowen Zhou

arXiv:2501.18362·cs.AI·June 9, 2025·3 cites

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

PDF

Open Access 10 Models 4 Datasets 1 Video

TL;DR

MedXpertQA is a comprehensive, expert-level medical reasoning benchmark with multimodal questions, designed to rigorously evaluate advanced medical knowledge and reasoning capabilities of AI models.

Contribution

The paper introduces MedXpertQA, a new challenging benchmark with multimodal and specialty-specific questions, addressing limitations of previous datasets and including expert validation.

Findings

01

Evaluated 18 models on MedXpertQA showing varying reasoning abilities.

02

MedXpertQA's multimodal and specialty-focused design enhances assessment of medical AI.

03

Benchmark sets a new standard for expert-level medical AI evaluation.

Abstract

We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding· slideslive

Taxonomy

TopicsBiomedical Text Mining and Ontologies