SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Gui Wang; YongSong Zhou; Kaijun Deng; Wooi Ping Cheah; Rong Qu; Jianfeng Ren; Linlin Shen

arXiv:2604.20319·cs.CV·April 23, 2026

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Gui Wang, YongSong Zhou, Kaijun Deng, Wooi Ping Cheah, Rong Qu, Jianfeng Ren, Linlin Shen

PDF

1 Repo

TL;DR

SurgCoT introduces a comprehensive benchmark to evaluate and improve multi-modal large language models' spatiotemporal reasoning in surgical videos, covering diverse procedures and reasoning tasks.

Contribution

This work presents SurgCoT, the first unified benchmark for chain-of-thought reasoning in surgical videos, with detailed annotations and evaluation of leading models.

Findings

01

Commercial models outperform open-source and specialized models.

02

Significant gaps in surgical chain-of-thought reasoning are identified.

03

SurgCoT effectively evaluates and enhances spatiotemporal reasoning capabilities.

Abstract

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CVI-SZU/SurgCoT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.