Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis
Kubilay Can Demir, Belen Lojo Rodriguez, Tobias Weise, Andreas Maier,, Seung Hee Yang

TL;DR
This paper introduces a multimodal framework combining speech and image data for accurate surgical phase recognition in operating rooms, significantly improving performance over previous methods.
Contribution
The study presents a novel multimodal approach using GMU and MS-TCN for surgical workflow analysis, demonstrating enhanced accuracy and effectiveness.
Findings
Achieved 92.65% frame-wise accuracy and 92.30% F1-score.
Approximately 10% improvement over previous methods.
Validated the benefit of multimodal data integration.
Abstract
To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations. Our method merges speech and image models and uses them separately in different surgical phases. Based on the evaluation of 28 operations, we report a frame-wise accuracy of 92.65 3.52% and an F1-score of 92.30 3.82%. Our results show approximately 10% improvement in both metrics over previous work and validate the effectiveness of integrating multimodal data for the surgical phase recognition task. We further investigate the contribution of individual data channels by comparing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training
