Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong, Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang,, Mohit Bansal, Heng Ji

TL;DR
This paper introduces VidIL, a few-shot video-language model that leverages image-language models to generate video descriptions and answer questions without extensive video dataset training, excelling in diverse tasks including future event prediction.
Contribution
The paper presents VidIL, a novel approach that uses image-language models for few-shot video-language tasks, eliminating the need for pretraining or finetuning on video datasets.
Findings
VidIL achieves strong performance on multiple video-language tasks.
It significantly outperforms supervised models in future event prediction.
The approach demonstrates the versatility of language models in understanding videos.
Abstract
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
