VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text   Understanding

Hu Xu; Gargi Ghosh; Po-Yao Huang; Dmytro Okhonko; Armen Aghajanyan,; Florian Metze; Luke Zettlemoyer; Christoph Feichtenhofer

arXiv:2109.14084·cs.CV·October 4, 2021·6 cites

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan,, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

PDF

Open Access 2 Repos

TL;DR

VideoCLIP introduces a contrastive pre-training method for zero-shot video-text understanding, achieving state-of-the-art results across multiple tasks without using labeled data, by training a transformer on overlapping video-text pairs.

Contribution

The paper proposes VideoCLIP, a novel contrastive pre-training approach that enables zero-shot video-text understanding without labeled data, outperforming previous supervised methods.

Findings

01

State-of-the-art performance on sequence-level text-video retrieval.

02

Superior results on VideoQA, action localization, and segmentation tasks.

03

Outperforms some supervised approaches in zero-shot settings.

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning