Invariant Grounding for Video Question Answering
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua

TL;DR
This paper introduces Invariant Grounding for VideoQA (IGV), a causal framework that improves reasoning accuracy and robustness by focusing on question-critical scenes and reducing reliance on spurious correlations.
Contribution
The paper proposes a novel causal learning framework, IGV, that enhances VideoQA models by grounding question-critical scenes invariantly across interventions.
Findings
IGV outperforms baselines in accuracy on three datasets.
IGV improves visual explainability of VideoQA models.
IGV enhances generalization to unseen data.
Abstract
Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers as the alignments. However, ERM can be problematic, because it tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes. As a result, the VideoQA models suffer from unreliable reasoning. In this work, we first take a causal look at VideoQA and argue that invariant grounding is the key to ruling out the spurious correlations. Towards this end, we propose a new learning framework, Invariant Grounding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
