FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering
Ju-Young Oh, Ho-Joong Kim, and Seong-Whan Lee

TL;DR
This paper introduces FIQ, a novel method that generates fundamental Q&A pairs from videos to improve reasoning and generalization in video question answering systems, achieving state-of-the-art results.
Contribution
FIQ is the first approach to generate fundamental scene-based Q&A pairs to enhance reasoning in VQA, integrating question embeddings with visual features for better adaptability.
Findings
FIQ outperforms existing methods on SUTD-TrafficQA dataset.
Generated Q&A pairs improve model understanding of scene context.
VQ-CAlign module preserves domain-specific details for better task performance.
Abstract
Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
