Analyzing Zero-Shot Abilities of Vision-Language Models on Video   Understanding Tasks

Avinash Madasu; Anahita Bhiwandiwalla; Vasudev Lal

arXiv:2310.04914·cs.CV·November 28, 2023·1 cites

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

Avinash Madasu, Anahita Bhiwandiwalla, Vasudev Lal

PDF

Open Access 1 Repo

TL;DR

This study evaluates how well existing image-text models can be adapted to various video understanding tasks in a zero-shot setting, highlighting their strengths and limitations without additional video-specific pretraining.

Contribution

The paper provides a comprehensive analysis of the zero-shot generalization of image-text models on diverse video tasks, demonstrating their potential and limitations.

Findings

01

Strong performance on video action recognition, retrieval, and multiple choice tasks.

02

Moderate performance on video captioning.

03

Poor performance on video question answering.

Abstract

Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intellabs/multimodal_cognitive_ai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsFocus