Invariant Grounding for Video Question Answering

Yicong Li; Xiang Wang; Junbin Xiao; Wei Ji; Tat-Seng Chua

arXiv:2206.02349·cs.CV·June 7, 2022

Invariant Grounding for Video Question Answering

Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua

PDF

Open Access 1 Repo

TL;DR

This paper introduces Invariant Grounding for VideoQA (IGV), a causal framework that improves reasoning accuracy and robustness by focusing on question-critical scenes and reducing reliance on spurious correlations.

Contribution

The paper proposes a novel causal learning framework, IGV, that enhances VideoQA models by grounding question-critical scenes invariantly across interventions.

Findings

01

IGV outperforms baselines in accuracy on three datasets.

02

IGV improves visual explainability of VideoQA models.

03

IGV enhances generalization to unseen data.

Abstract

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers as the alignments. However, ERM can be problematic, because it tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes. As a result, the VideoQA models suffer from unreliable reasoning. In this work, we first take a causal look at VideoQA and argue that invariant grounding is the key to ruling out the spurious correlations. Towards this end, we propose a new learning framework, Invariant Grounding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yl3800/igv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques