GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu; Te Luo; Pei-Yao Fu; Zhen Zhang; Zi-Long Wang; Yi-Fan Qu; Zi-Han Geng; Jia-Qi Xu; Lu Yao; Li-Yun Ma; Wei Su; Wei-Feng Chen; Quan-Lin Li; Shuo Wang; Ping-Hong Zhou

arXiv:2601.08183·cs.CV·January 15, 2026

GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu, Te Luo, Pei-Yao Fu, Zhen Zhang, Zi-Long Wang, Yi-Fan Qu, Zi-Han Geng, Jia-Qi Xu, Lu Yao, Li-Yun Ma, Wei Su, Wei-Feng Chen, Quan-Lin Li, Shuo Wang, Ping-Hong Zhou

PDF

Open Access

TL;DR

GI-Bench is a comprehensive benchmark evaluating multimodal large language models in gastrointestinal endoscopy, revealing their strengths in diagnostic reasoning but limitations in spatial grounding and factual accuracy compared to human experts.

Contribution

This work introduces GI-Bench, a new panoramic benchmark for assessing MLLMs in clinical endoscopy workflows, highlighting their performance gaps and potential for future improvement.

Findings

01

Gemini-3-Pro achieved state-of-the-art performance.

02

Models outperformed trainees in diagnostic reasoning.

03

Human lesion localization significantly outperformed models.

Abstract

Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColorectal Cancer Screening and Detection · AI in cancer detection · Artificial Intelligence in Healthcare and Education