SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels
Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

TL;DR
This paper introduces SRL-CLIP, an efficient method for adapting CLIP to video understanding by using structured semantic role labels and a small dataset, achieving strong zero-shot performance.
Contribution
The authors propose a novel approach using semantic role labels and rule-based captions to adapt CLIP for holistic video understanding with minimal data and training.
Findings
SRL-CLIP achieves comparable or better performance than larger models.
Contrastive finetuning on only 23k video-caption pairs is sufficient.
SRL-CLIP outperforms CLIP on multiple video benchmarks.
Abstract
Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
