SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago; Luca Carlini; Pelinsu Celebi Balyemez; Dennis Pierantozzi; Chiara Lena; Cesare Hassan; Danail Stoyanov; Elena De Momi; Sophia Bano; Mobarak I. Hoque

arXiv:2511.03325·cs.CV·April 24, 2026

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

PDF

1 Repo 1 Models

TL;DR

SurgViVQA introduces a novel temporally-grounded VideoQA model for surgical scenes, leveraging a Masked Video--Text Encoder and a large language model to improve understanding of dynamic intraoperative events.

Contribution

The paper presents SurgViVQA, a new model and dataset for temporally-aware surgical VideoQA, enhancing dynamic scene understanding over existing static image-based approaches.

Findings

01

Outperforms existing image-based VQA models in keyword accuracy.

02

Achieves +11% and +9% improvements on REAL-Colon-VQA and EndoVis18-VQA datasets.

03

Demonstrates robustness to question phrasing variations.

Abstract

Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

madratak/SurgViVQA
github

Models

🤗
kulsoom-abdullah/surgvivqa-qwen7b-audio
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.