CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

Yuxin He; An Li; Cheng Xue

arXiv:2602.06619·cs.CV·February 9, 2026

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

Yuxin He, An Li, Cheng Xue

PDF

Open Access

TL;DR

CauCLIP introduces a causality-inspired vision-language model that enhances surgical video understanding by learning domain-invariant features, effectively bridging the gap between synthetic and real data without target domain access.

Contribution

The paper presents a novel causality-inspired framework leveraging CLIP and domain augmentation techniques to improve domain generalization in surgical phase recognition.

Findings

01

Outperforms existing methods on SurgVisDom benchmark

02

Effectively reduces domain bias in surgical videos

03

Enhances robustness in real-world surgical scenarios

Abstract

Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning