Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach
Ju-Young Oh

TL;DR
This paper introduces FIQ, a novel framework that generates foundational Q&A pairs from videos to improve reasoning and generalization in video question answering models, achieving state-of-the-art results.
Contribution
The paper presents a new embedding-integrated approach for generating scene-level Q&A pairs and a VQ-CAlign module for better alignment of question embeddings with visual features.
Findings
FIQ outperforms baseline models on SUTD-TrafficQA dataset.
Generated Q&A pairs enrich scene understanding and reasoning.
VQ-CAlign improves task-specific embedding alignment.
Abstract
Conventional VQA approaches primarily rely on question-answer (Q&A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q&A pairs from descriptive information extracted directly from videos, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
