SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Shi Li; Vinkle Srivastav; Nicolas Chanel; Saurav Sharma; Nabani Banik; Lorenzo Arboit; Kun Yuan; Pietro Mascagni; Nicolas Padoy

arXiv:2603.29962·cs.CV·May 5, 2026

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy

PDF

1 Repo

TL;DR

SurgTEMP is a novel multimodal LLM framework designed for temporal-aware surgical video question answering, addressing challenges like temporal semantics and diverse intraoperative assessment tasks.

Contribution

It introduces a hierarchical visual memory module and a specialized training scheme, along with a large annotated dataset for surgical VQA.

Findings

01

SurgTEMP outperforms existing models on surgical VQA tasks.

02

The dataset CholeVidQA-32K enables comprehensive evaluation across multiple surgical assessment tasks.

Abstract

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://camma-public.github.io/SurgTEMP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.