Equivariant and Invariant Grounding for Video Question Answering

Yicong Li; Xiang Wang; Junbin Xiao; and Tat-Seng Chua

arXiv:2207.12783·cs.CL·July 27, 2022·1 cites

Equivariant and Invariant Grounding for Video Question Answering

Yicong Li, Xiang Wang, Junbin Xiao, and Tat-Seng Chua

PDF

Open Access 1 Repo

TL;DR

This paper introduces EIGV, a self-interpretable framework for VideoQA that explicitly grounds question-critical cues and distinguishes causal scenes from environment scenes, enhancing interpretability and accuracy.

Contribution

The paper proposes an intrinsic interpretability approach using equivariant and invariant grounding to improve visual-linguistic alignment in VideoQA models.

Findings

01

EIGV outperforms baselines in accuracy on three datasets.

02

EIGV provides clear visual explanations of answer reasoning.

03

The method effectively distinguishes causal scenes from environment scenes.

Abstract

Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals ``What part of the video should the model look at to answer the question?''. Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yl3800/eigv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling