TL;DR
SurgOnAir is a real-time, hierarchy-aware vision-language model that generates surgical video commentary instantly, capturing workflow transitions and evolving details without offline processing.
Contribution
It introduces a streaming, hierarchy-aware surgical narration model trained on a new dataset, enabling immediate, fine-grained, and hierarchical understanding of surgical procedures.
Findings
Enables instant, fine-grained surgical narration.
Captures and signals key workflow transitions.
Outperforms existing offline methods in real-time understanding.
Abstract
Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
