MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing
Minghao Liu, Zhitao He, Zhiyuan Fan, Qingyun Wang, Yi R. Fung

TL;DR
MedEBench introduces a comprehensive benchmark for evaluating the reliability of text-guided medical image editing, addressing a critical gap in clinical applications with detailed evaluation metrics and diagnostic tools.
Contribution
It provides the first standardized evaluation framework and diagnostic analysis for assessing and improving the reliability of text-guided medical image editing models.
Findings
Seven state-of-the-art models show consistent failure patterns.
Attention alignment reveals common mislocalization issues.
Benchmark covers 70 editing tasks across 13 anatomical regions.
Abstract
Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging
