SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi; Tae Kyeong Jeong; Garam Kim; Jaemin Lee; Yeongyoon Koh; In Cheul Choi; Jae-Ho Chung; Jong Woong Park; Juyoun Park

arXiv:2511.21339·cs.CV·November 27, 2025

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh, In Cheul Choi, Jae-Ho Chung, Jong Woong Park, Juyoun Park

PDF

Open Access 3 Datasets

TL;DR

SurgMLLMBench introduces a comprehensive multimodal benchmark dataset for surgical scene understanding, combining pixel-level segmentation and VQA annotations across various surgical domains to facilitate development and evaluation of interactive surgical AI models.

Contribution

The paper presents SurgMLLMBench, a unified multimodal benchmark with new datasets and annotations, enabling consistent evaluation of surgical scene understanding models beyond traditional VQA tasks.

Findings

01

Models trained on SurgMLLMBench perform well across domains.

02

The benchmark supports richer visual-conversational interactions.

03

The dataset enhances reproducible evaluation in surgical AI research.

Abstract

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling