CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
Yuxin He, An Li, Cheng Xue

TL;DR
CauCLIP introduces a causality-inspired vision-language model that enhances surgical video understanding by learning domain-invariant features, effectively bridging the gap between synthetic and real data without target domain access.
Contribution
The paper presents a novel causality-inspired framework leveraging CLIP and domain augmentation techniques to improve domain generalization in surgical phase recognition.
Findings
Outperforms existing methods on SurgVisDom benchmark
Effectively reduces domain bias in surgical videos
Enhances robustness in real-world surgical scenarios
Abstract
Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
