Improving Video Question Answering through query-based frame selection
Himanshu Patil, Geo Jolly, Ramana Raja Buddala, Ganesh Ramakrishnan, Rohit Saluja

TL;DR
This paper introduces a query-based frame selection method for VideoQA that improves accuracy by selecting the most relevant frames for answering questions, outperforming uniform sampling strategies.
Contribution
The paper proposes a novel query-based frame selection approach using SMI functions, enhancing VideoQA performance over traditional uniform sampling methods.
Findings
Up to 4% accuracy improvement on MVBench dataset.
Query-based selection better aligns frames with questions.
Method effective across multiple VideoQA models.
Abstract
Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
