Voice-Interactive Surgical Agent for Multimodal Patient Data Control
Hyeryun Park, Byung Mo Gu, Jun Hee Lee, Byeong Hyeon Choi, Sekeun Kim, Hyun Koo Kim, Kyungsang Kim

TL;DR
This paper introduces VISA, a voice-interactive system for robotic surgery that uses large language models to interpret commands and manipulate multimodal patient data without disrupting surgical workflow.
Contribution
The paper presents a hierarchical multi-agent framework powered by LLMs for voice-controlled surgical data management, including a new dataset and evaluation metric.
Findings
VISA achieves high accuracy in command execution.
The system effectively handles transcription errors and ambiguous language.
VISA demonstrates robustness and scalability in surgical scenarios.
Abstract
In robotic surgery, surgeons fully engage their hands and visual attention in procedures, making it difficult to access and manipulate multimodal patient data without interrupting the workflow. To overcome this problem, we propose a Voice-Interactive Surgical Agent (VISA) built on a hierarchical multi-agent framework consisting of an orchestration agent and three task-specific agents driven by Large Language Models (LLMs). These LLM-based agents autonomously plan, refine, validate, and reason to interpret voice commands and execute tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models within surgical video. We construct a dataset of 240 user commands organized into hierarchical categories and introduce the Multi-level Orchestration Evaluation Metric (MOEM) that evaluates the performance and robustness at both the command and category…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Soft Robotics and Applications · Multimodal Machine Learning Applications
