Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu, Anahita Bhiwandiwalla, Vasudev Lal

TL;DR
This study evaluates how well existing image-text models can be adapted to various video understanding tasks in a zero-shot setting, highlighting their strengths and limitations without additional video-specific pretraining.
Contribution
The paper provides a comprehensive analysis of the zero-shot generalization of image-text models on diverse video tasks, demonstrating their potential and limitations.
Findings
Strong performance on video action recognition, retrieval, and multiple choice tasks.
Moderate performance on video captioning.
Poor performance on video question answering.
Abstract
Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsFocus
