OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal   Omni-Scale Feature Learning

Mushui Liu; Bozheng Li; Yunlong Yu

arXiv:2408.06158·cs.CV·August 13, 2024

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

Mushui Liu, Bozheng Li, Yunlong Yu

PDF

Open Access 1 Repo

TL;DR

OmniCLIP enhances CLIP for video recognition by learning comprehensive spatial-temporal features through novel modules, significantly improving accuracy across various recognition tasks.

Contribution

The paper introduces OmniCLIP, a framework that adapts CLIP for video recognition by integrating spatial, temporal, and dynamic spatial-temporal features with new modules.

Findings

01

Achieves 74.30% top-1 accuracy on HMDB51 in 16-shot setting

02

Outperforms recent methods like MotionPrompt with full training data

03

Effective in supervised, few-shot, and zero-shot video recognition tasks

Abstract

Recent Vision-Language Models (VLMs) \textit{e.g.} CLIP have made great progress in video recognition. Despite the improvement brought by the strong visual backbone in extracting spatial features, CLIP still falls short in capturing and integrating spatial-temporal features which is essential for video recognition. In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features. This is achieved through the design of spatial-temporal blocks that include parallel temporal adapters (PTA), enabling efficient temporal modeling. Additionally, we introduce a self-prompt generator (SPG) module to capture dynamic object spatial features. The synergy between PTA and SPG allows OmniCLIP to discern varying spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaobul/omniclip
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques

MethodsContrastive Language-Image Pre-training