Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

TL;DR
Open-VCLIP effectively transforms CLIP into a zero-shot video classifier capable of recognizing unseen actions by extending its capabilities with minimal modifications and a novel weight interpolation training method, achieving state-of-the-art results.
Contribution
The paper introduces Open-VCLIP, a novel approach that adapts CLIP for zero-shot video recognition with minimal changes and a new weight interpolation technique for continual learning.
Findings
Achieves 87.9% zero-shot accuracy on UCF dataset.
Outperforms state-of-the-art methods by up to 12.2%.
Demonstrates effective generalization to unseen video actions.
Abstract
Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-shot learning abilities for image understanding, yet limited effort has been made to investigate CLIP for zero-shot video recognition. We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier that can recognize unseen actions and events at test time. Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization. We formally show that training an Open-VCLIP is equivalent to continual learning with zero historical data. To address this problem, we propose Interpolated Weight Optimization, which utilizes the benefit of weight interpolation in both training and test time. We evaluate our method on three popular and challenging action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsTest · Contrastive Language-Image Pre-training
