Building an Open-Vocabulary Video CLIP Model with Better Architectures,   Optimization and Data

Zuxuan Wu; Zejia Weng; Wujian Peng; Xitong Yang; Ang Li; Larry S.; Davis; Yu-Gang Jiang

arXiv:2310.05010·cs.CV·October 10, 2023

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S., Davis, Yu-Gang Jiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Open-VCLIP++, a framework that adapts CLIP for zero-shot video recognition by capturing spatial-temporal relationships and leveraging large language models, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes Open-VCLIP++, a novel adaptation of CLIP for zero-shot video classification, incorporating weight interpolation and fine-grained video descriptions for improved generalization.

Findings

01

Achieves zero-shot accuracy of 88.1% on UCF dataset.

02

Outperforms existing methods by over 8% on multiple datasets.

03

Demonstrates effective transfer of CLIP to video domain with less fine-tuning data.

Abstract

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wengzejia1/open-vclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training