CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition
Florian Stilz, Vinkle Srivastav, Nassir Navab, and Nicolas Padoy

TL;DR
CliPPER is a novel video-language pretraining framework designed for long-form intraoperative surgical videos, improving event recognition through innovative multimodal alignment strategies and achieving state-of-the-art results.
Contribution
Introduces CliPPER, a new pretraining approach with novel objectives for fine-grained temporal video-text understanding in surgical videos, addressing data scarcity and complex task requirements.
Findings
Achieves state-of-the-art performance on surgical benchmarks
Effective zero-shot recognition of surgical phases and instruments
Improves multimodal alignment with new training objectives
Abstract
Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
