Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via   Interpolated Weight Optimization

Zejia Weng; Xitong Yang; Ang Li; Zuxuan Wu; Yu-Gang Jiang

arXiv:2302.00624·cs.CV·June 1, 2023·6 cites

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access 1 Repo 1 Video

TL;DR

Open-VCLIP effectively transforms CLIP into a zero-shot video classifier capable of recognizing unseen actions by extending its capabilities with minimal modifications and a novel weight interpolation training method, achieving state-of-the-art results.

Contribution

The paper introduces Open-VCLIP, a novel approach that adapts CLIP for zero-shot video recognition with minimal changes and a new weight interpolation technique for continual learning.

Findings

01

Achieves 87.9% zero-shot accuracy on UCF dataset.

02

Outperforms state-of-the-art methods by up to 12.2%.

03

Demonstrates effective generalization to unseen video actions.

Abstract

Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-shot learning abilities for image understanding, yet limited effort has been made to investigate CLIP for zero-shot video recognition. We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier that can recognize unseen actions and events at test time. Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization. We formally show that training an Open-VCLIP is equivalent to continual learning with zero historical data. To address this problem, we propose Interpolated Weight Optimization, which utilizes the benefit of weight interpolation in both training and test time. We evaluate our method on three popular and challenging action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wengzejia1/open-vclip
pytorchOfficial

Videos

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsTest · Contrastive Language-Image Pre-training