VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh,, Yonghui Wu, Jiahui Yu

TL;DR
VideoCoCa leverages a pretrained image-text contrastive captioner to efficiently adapt to video-text tasks with minimal training, achieving state-of-the-art zero-shot performance and strong results in downstream tasks.
Contribution
The paper introduces VideoCoCa, a method that reuses a pretrained image-text contrastive model for video-text tasks with minimal adaptation, enabling zero-shot and lightweight fine-tuning capabilities.
Findings
State-of-the-art zero-shot video classification results.
Effective zero-shot text-to-video retrieval performance.
Strong results in video question-answering and captioning after lightweight fine-tuning.
Abstract
We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
