Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Wei Han, Hui Chen, Min-Yen Kan, Soujanya Poria

TL;DR
This paper introduces two novel frame sampling strategies, MDF and MIF, to improve video question-answering efficiency by selecting key frames relevant to the questions, enhancing performance of image-text models.
Contribution
It proposes domain-aware and question-guided frame sampling methods that reduce computational costs while maintaining high accuracy in video question-answering tasks.
Findings
MDF and MIF strategies outperform random sampling in key frame selection.
The methods significantly improve accuracy across multiple datasets and models.
Source code is publicly available for reproducibility.
Abstract
Video question-answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
