Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z.Y. Mao; Francis X. Creighton; Russell H. Taylor; Manish Sahu

arXiv:2603.16024·cs.CV·April 17, 2026

Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z.Y. Mao, Francis X. Creighton, Russell H. Taylor, Manish Sahu

PDF

TL;DR

This paper presents a speech-guided, video-based surgical system for skull base procedures that offers real-time guidance without external hardware, improving workflow and accuracy.

Contribution

It introduces a novel speech-interactive framework that performs perception and guidance tasks directly on intraoperative video streams, eliminating the need for external trackers.

Findings

01

Achieved a mean tool-tip position error of 2.32 mm in experiments.

02

Completed segmentation and registration within approximately two minutes.

03

Demonstrated comparable accuracy to commercial optical tracking systems.

Abstract

We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.