Frozen CLIP Models are Efficient Video Learners
Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo,, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

TL;DR
This paper introduces an efficient method for video recognition that leverages frozen CLIP image features and a lightweight Transformer decoder to learn high-quality video representations without fine-tuning the backbone.
Contribution
The paper proposes a novel framework called Efficient Video Learning (EVL) that uses frozen CLIP features and a lightweight decoder for effective video recognition.
Findings
Achieves high accuracy on multiple video datasets.
Reduces training resources compared to traditional fine-tuning.
Demonstrates effective temporal feature extraction with local modules.
Abstract
Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization
