SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake; Jiayuan Huang; Runlong He; Danyal Z. Khan; Evangelos B. Mazomenos; Sophia Bano; Hani J. Marcus; Danail Stoyanov; Matthew J. Clarkson; Mobarak I. Hoque

arXiv:2511.03178·cs.CV·November 6, 2025

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake, Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarak I. Hoque

PDF

Open Access

TL;DR

This paper introduces a new dataset and a novel model for anticipating future surgical events by leveraging temporal cross-attention, enabling proactive surgical assistance beyond current static scene understanding.

Contribution

It presents the first VQA dataset for forward-looking surgical reasoning and proposes SurgAnt-ViVQA, a temporal cross-attention model that improves surgical event anticipation.

Findings

01

SurgAnt-ViVQA outperforms existing image and video baselines.

02

Temporal recurrence and gated fusion significantly improve performance.

03

Optimal frame usage balances fluency and numeric time estimation.

Abstract

Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning