Language Models with Image Descriptors are Strong Few-Shot   Video-Language Learners

Zhenhailong Wang; Manling Li; Ruochen Xu; Luowei Zhou; Jie Lei; Xudong; Lin; Shuohang Wang; Ziyi Yang; Chenguang Zhu; Derek Hoiem; Shih-Fu Chang,; Mohit Bansal; Heng Ji

arXiv:2205.10747·cs.CV·October 14, 2022·52 cites

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong, Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang,, Mohit Bansal, Heng Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VidIL, a few-shot video-language model that leverages image-language models to generate video descriptions and answer questions without extensive video dataset training, excelling in diverse tasks including future event prediction.

Contribution

The paper presents VidIL, a novel approach that uses image-language models for few-shot video-language tasks, eliminating the need for pretraining or finetuning on video datasets.

Findings

01

VidIL achieves strong performance on multiple video-language tasks.

02

It significantly outperforms supervised models in future event prediction.

03

The approach demonstrates the versatility of language models in understanding videos.

Abstract

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mikewangwzhl/vidil
pytorchOfficial

Videos

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling