OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning
Mushui Liu, Bozheng Li, Yunlong Yu

TL;DR
OmniCLIP enhances CLIP for video recognition by learning comprehensive spatial-temporal features through novel modules, significantly improving accuracy across various recognition tasks.
Contribution
The paper introduces OmniCLIP, a framework that adapts CLIP for video recognition by integrating spatial, temporal, and dynamic spatial-temporal features with new modules.
Findings
Achieves 74.30% top-1 accuracy on HMDB51 in 16-shot setting
Outperforms recent methods like MotionPrompt with full training data
Effective in supervised, few-shot, and zero-shot video recognition tasks
Abstract
Recent Vision-Language Models (VLMs) \textit{e.g.} CLIP have made great progress in video recognition. Despite the improvement brought by the strong visual backbone in extracting spatial features, CLIP still falls short in capturing and integrating spatial-temporal features which is essential for video recognition. In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features. This is achieved through the design of spatial-temporal blocks that include parallel temporal adapters (PTA), enabling efficient temporal modeling. Additionally, we introduce a self-prompt generator (SPG) module to capture dynamic object spatial features. The synergy between PTA and SPG allows OmniCLIP to discern varying spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques
MethodsContrastive Language-Image Pre-training
