Open-Ended and Knowledge-Intensive Video Question Answering

Md Zarif Ul Alam; Hamed Zamani

arXiv:2502.11747·cs.IR·February 19, 2025

Open-Ended and Knowledge-Intensive Video Question Answering

Md Zarif Ul Alam, Hamed Zamani

PDF

Open Access

TL;DR

This paper explores knowledge-intensive video question answering using retrieval-augmented generation, demonstrating that optimized retrieval strategies significantly enhance accuracy on complex, open-ended questions in video understanding tasks.

Contribution

It introduces a comprehensive analysis of retrieval-augmented methods for KI-VideoQA, highlighting the importance of modality choice, retrieval strategy, and query formulation for improved performance.

Findings

01

Retrieval augmentation improves KI-VideoQA performance.

02

Modality and retrieval method critically affect results.

03

Achieved 17.5% accuracy improvement on KnowIT VQA.

Abstract

Video question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning