MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Junzhi Ning; Jiashi Lin; Yingying Fang; Wei Li; Jiyao Liu; Cheng Tang; Chenglong Ma; Wenhao Tang; Tianbin Li; Ziyan Huang; Guang Yang; Junjun He

arXiv:2604.10755·cs.CV·May 14, 2026

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Junzhi Ning, Jiashi Lin, Yingying Fang, Wei Li, Jiyao Liu, Cheng Tang, Chenglong Ma, Wenhao Tang, Tianbin Li, Ziyan Huang, Guang Yang, Junjun He

PDF

1 Datasets

TL;DR

This paper introduces MMRareBench, a comprehensive benchmark for evaluating multimodal and multi-image clinical reasoning in rare diseases, revealing significant gaps in current model capabilities.

Contribution

It presents the first joint evaluation benchmark for multimodal and multi-image reasoning in rare diseases, including curated data and a systematic assessment of 23 models.

Findings

01

Medical models lag behind general-purpose models in multi-image tasks.

02

Treatment planning performance is universally low across models.

03

Fine-tuning improves diagnosis but reduces multi-image reasoning ability.

Abstract

Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

junzhin/MMrarebench
dataset· 226 dl
226 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.