Open-Ended and Knowledge-Intensive Video Question Answering
Md Zarif Ul Alam, Hamed Zamani

TL;DR
This paper explores knowledge-intensive video question answering using retrieval-augmented generation, demonstrating that optimized retrieval strategies significantly enhance accuracy on complex, open-ended questions in video understanding tasks.
Contribution
It introduces a comprehensive analysis of retrieval-augmented methods for KI-VideoQA, highlighting the importance of modality choice, retrieval strategy, and query formulation for improved performance.
Findings
Retrieval augmentation improves KI-VideoQA performance.
Modality and retrieval method critically affect results.
Achieved 17.5% accuracy improvement on KnowIT VQA.
Abstract
Video question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
