MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

TL;DR
MedXpertQA is a comprehensive, expert-level medical reasoning benchmark with multimodal questions, designed to rigorously evaluate advanced medical knowledge and reasoning capabilities of AI models.
Contribution
The paper introduces MedXpertQA, a new challenging benchmark with multimodal and specialty-specific questions, addressing limitations of previous datasets and including expert validation.
Findings
Evaluated 18 models on MedXpertQA showing varying reasoning abilities.
MedXpertQA's multimodal and specialty-focused design enhances assessment of medical AI.
Benchmark sets a new standard for expert-level medical AI evaluation.
Abstract
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/medgemma-1.5-4b-itmodel· 86k dl· ♡ 53686k dl♡ 536
- 🤗google/medgemma-4b-itmodel· 170k dl· ♡ 925170k dl♡ 925
- 🤗unsloth/medgemma-27b-it-GGUFmodel· 4.4k dl· ♡ 384.4k dl♡ 38
- 🤗google/medgemma-4b-ptmodel· 1.1k dl· ♡ 1481.1k dl♡ 148
- 🤗google/medgemma-27b-text-itmodel· 37k dl· ♡ 41237k dl♡ 412
- 🤗google/medgemma-27b-itmodel· 107k dl· ♡ 330107k dl♡ 330
- 🤗pszemraj/medgemma-4b-it-hereticmodel· 46 dl· ♡ 546 dl♡ 5
- 🤗pszemraj/medgemma-27b-text-heretic_medmodel· 11 dl· ♡ 511 dl♡ 5
- 🤗unsloth/medgemma-1.5-4b-it-GGUFmodel· 6.7k dl· ♡ 336.7k dl♡ 33
- 🤗unsloth/medgemma-27b-text-itmodel· 519 dl· ♡ 7519 dl♡ 7
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies
