Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery
Jecia Z.Y. Mao, Francis X. Creighton, Russell H. Taylor, Manish Sahu

TL;DR
This paper presents a speech-guided, video-based surgical system for skull base procedures that offers real-time guidance without external hardware, improving workflow and accuracy.
Contribution
It introduces a novel speech-interactive framework that performs perception and guidance tasks directly on intraoperative video streams, eliminating the need for external trackers.
Findings
Achieved a mean tool-tip position error of 2.32 mm in experiments.
Completed segmentation and registration within approximately two minutes.
Demonstrated comparable accuracy to commercial optical tracking systems.
Abstract
We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
