SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

Darshan Singh S; Zeeshan Khan; Makarand Tapaswi

arXiv:2401.07669·cs.CV·April 28, 2026·2 cites

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

PDF

TL;DR

This paper introduces SRL-CLIP, an efficient method for adapting CLIP to video understanding by using structured semantic role labels and a small dataset, achieving strong zero-shot performance.

Contribution

The authors propose a novel approach using semantic role labels and rule-based captions to adapt CLIP for holistic video understanding with minimal data and training.

Findings

01

SRL-CLIP achieves comparable or better performance than larger models.

02

Contrastive finetuning on only 23k video-caption pairs is sufficient.

03

SRL-CLIP outperforms CLIP on multiple video benchmarks.

Abstract

Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.