TL;DR
SurgTEMP is a novel multimodal LLM framework designed for temporal-aware surgical video question answering, addressing challenges like temporal semantics and diverse intraoperative assessment tasks.
Contribution
It introduces a hierarchical visual memory module and a specialized training scheme, along with a large annotated dataset for surgical VQA.
Findings
SurgTEMP outperforms existing models on surgical VQA tasks.
The dataset CholeVidQA-32K enables comprehensive evaluation across multiple surgical assessment tasks.
Abstract
Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
