VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive   Captioners

Shen Yan; Tao Zhu; Zirui Wang; Yuan Cao; Mi Zhang; Soham Ghosh,; Yonghui Wu; Jiahui Yu

arXiv:2212.04979·cs.CV·March 17, 2023·20 cites

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh,, Yonghui Wu, Jiahui Yu

PDF

Open Access

TL;DR

VideoCoCa leverages a pretrained image-text contrastive captioner to efficiently adapt to video-text tasks with minimal training, achieving state-of-the-art zero-shot performance and strong results in downstream tasks.

Contribution

The paper introduces VideoCoCa, a method that reuses a pretrained image-text contrastive model for video-text tasks with minimal adaptation, enabling zero-shot and lightweight fine-tuning capabilities.

Findings

01

State-of-the-art zero-shot video classification results.

02

Effective zero-shot text-to-video retrieval performance.

03

Strong results in video question-answering and captioning after lightweight fine-tuning.

Abstract

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization