Specialized Foundation Models for Intelligent Operating Rooms

Ege \"Ozsoy; Chantal Pellegrini; David Bani-Harouni; Kun Yuan; Matthias Keicher; Nassir Navab

arXiv:2505.12890·cs.CV·July 8, 2025

Specialized Foundation Models for Intelligent Operating Rooms

Ege \"Ozsoy, Chantal Pellegrini, David Bani-Harouni, Kun Yuan, Matthias Keicher, Nassir Navab

PDF

Open Access

TL;DR

This paper introduces ORQA, a multimodal foundation model for surgical understanding that unifies visual, auditory, and structured data, significantly improving performance over generalist models in complex OR environments.

Contribution

The paper presents ORQA, a novel multimodal foundation model tailored for surgical settings, with a question-answering framework and smaller variants for diverse deployment needs.

Findings

01

ORQA outperforms general vision-language models in surgical scene perception.

02

Benchmark results show ORQA's superior accuracy across multiple surgical tasks.

03

Smaller ORQA models maintain strong performance with reduced computational requirements.

Abstract

Surgical procedures unfold in complex environments demanding coordination between surgical teams, tools, imaging and increasingly, intelligent robotic systems. Ensuring safety and efficiency in ORs of the future requires intelligent systems, like surgical robots, smart instruments and digital copilots, capable of understanding complex activities and hazards of surgeries. Yet, existing computational approaches, lack the breadth, and generalization needed for comprehensive OR understanding. We introduce ORQA, a multimodal foundation model unifying visual, auditory, and structured data for holistic surgical understanding. ORQA's question-answering framework empowers diverse tasks, serving as an intelligence core for a broad spectrum of surgical technologies. We benchmark ORQA against generalist vision-language models, including ChatGPT and Gemini, and show that while they struggle to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Cardiac, Anesthesia and Surgical Outcomes

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation