Specialized Foundation Models for Intelligent Operating Rooms
Ege \"Ozsoy, Chantal Pellegrini, David Bani-Harouni, Kun Yuan, Matthias Keicher, Nassir Navab

TL;DR
This paper introduces ORQA, a multimodal foundation model for surgical understanding that unifies visual, auditory, and structured data, significantly improving performance over generalist models in complex OR environments.
Contribution
The paper presents ORQA, a novel multimodal foundation model tailored for surgical settings, with a question-answering framework and smaller variants for diverse deployment needs.
Findings
ORQA outperforms general vision-language models in surgical scene perception.
Benchmark results show ORQA's superior accuracy across multiple surgical tasks.
Smaller ORQA models maintain strong performance with reduced computational requirements.
Abstract
Surgical procedures unfold in complex environments demanding coordination between surgical teams, tools, imaging and increasingly, intelligent robotic systems. Ensuring safety and efficiency in ORs of the future requires intelligent systems, like surgical robots, smart instruments and digital copilots, capable of understanding complex activities and hazards of surgeries. Yet, existing computational approaches, lack the breadth, and generalization needed for comprehensive OR understanding. We introduce ORQA, a multimodal foundation model unifying visual, auditory, and structured data for holistic surgical understanding. ORQA's question-answering framework empowers diverse tasks, serving as an intelligence core for a broad spectrum of surgical technologies. We benchmark ORQA against generalist vision-language models, including ChatGPT and Gemini, and show that while they struggle to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Cardiac, Anesthesia and Surgical Outcomes
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation
