Videoprompter: an ensemble of foundational models for zero-shot video understanding
Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan,, Mubarak Shah

TL;DR
This paper introduces Videoprompter, a framework combining vision-language models with generative video-to-text and text-to-text models, enhancing zero-shot video understanding by generating descriptive cues and context-aware prompts.
Contribution
It proposes a novel approach that integrates descriptive video conversion and hierarchical prompts to improve zero-shot video classification and retrieval tasks.
Findings
Consistent performance improvements across multiple benchmarks.
Effective enhancement of zero-shot action recognition.
Improved video-to-text and text-to-video retrieval results.
Abstract
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query visual features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and employ a video-to-text model to convert the query video to its descriptive form. The resulting descriptions contain vital visual cues of the query video, such as what objects are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
