EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu; Boyun Zheng; Wenting Chen; Zhihao Peng; Zhenfei Yin; Jing Shao; Jiancong Hu; Yixuan Yuan

arXiv:2505.23601·cs.CV·September 25, 2025

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu, Boyun Zheng, Wenting Chen, Zhihao Peng, Zhenfei Yin, Jing Shao, Jiancong Hu, Yixuan Yuan

PDF

Open Access 3 Datasets 1 Video

TL;DR

EndoBench is a comprehensive benchmark designed to evaluate multi-modal large language models across diverse endoscopic scenarios and tasks, revealing current models' strengths and gaps compared to human clinicians.

Contribution

This work introduces EndoBench, the first extensive benchmark covering multiple endoscopic scenarios and tasks to assess MLLMs' clinical capabilities in realistic settings.

Findings

01

Proprietary MLLMs outperform open-source and medical models but lag behind clinicians.

02

Fine-tuning improves task accuracy significantly.

03

Model performance is affected by prompt format and task complexity.

Abstract

Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis· slideslive

Taxonomy

TopicsColorectal Cancer Screening and Detection · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training