CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Florian Stilz; Vinkle Srivastav; Nassir Navab; and Nicolas Padoy

arXiv:2603.24539·cs.CV·March 26, 2026

CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Florian Stilz, Vinkle Srivastav, Nassir Navab, and Nicolas Padoy

PDF

Open Access

TL;DR

CliPPER is a novel video-language pretraining framework designed for long-form intraoperative surgical videos, improving event recognition through innovative multimodal alignment strategies and achieving state-of-the-art results.

Contribution

Introduces CliPPER, a new pretraining approach with novel objectives for fine-grained temporal video-text understanding in surgical videos, addressing data scarcity and complex task requirements.

Findings

01

Achieves state-of-the-art performance on surgical benchmarks

02

Effective zero-shot recognition of surgical phases and instruments

03

Improves multimodal alignment with new training objectives

Abstract

Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning