Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S., Davis, Yu-Gang Jiang

TL;DR
This paper introduces Open-VCLIP++, a framework that adapts CLIP for zero-shot video recognition by capturing spatial-temporal relationships and leveraging large language models, achieving state-of-the-art results across multiple datasets.
Contribution
The paper proposes Open-VCLIP++, a novel adaptation of CLIP for zero-shot video classification, incorporating weight interpolation and fine-grained video descriptions for improved generalization.
Findings
Achieves zero-shot accuracy of 88.1% on UCF dataset.
Outperforms existing methods by over 8% on multiple datasets.
Demonstrates effective transfer of CLIP to video domain with less fine-tuning data.
Abstract
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
