MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

Junjie Yang; Yuhao Yan; Gang Wu; Yuxuan Wang; Ruoyu Liang; Xinjie Jiang; Xiang Wan; Fenglei Fan; Yongquan Zhang; Feiwei Qin; Changmiao Wang

arXiv:2511.13135·cs.CV·November 19, 2025

MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

Junjie Yang, Yuhao Yan, Gang Wu, Yuxuan Wang, Ruoyu Liang, Xinjie Jiang, Xiang Wan, Fenglei Fan, Yongquan Zhang, Feiwei Qin, Changmiao Wang

PDF

Open Access

TL;DR

MedGEN-Bench is a comprehensive multimodal benchmark designed to evaluate open-ended medical image and text generation, emphasizing complex reasoning and clinical relevance across multiple modalities and tasks.

Contribution

It introduces a new benchmark with diverse, expert-validated data and a multi-faceted evaluation framework for assessing medical multimodal AI systems.

Findings

01

Existing models show limited performance on complex clinical tasks.

02

The benchmark reveals significant gaps in current multimodal medical AI capabilities.

03

Evaluation metrics highlight the importance of clinical relevance in generative tasks.

Abstract

As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce MedGEN-Bench, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Healthcare and Education