Equivariant and Invariant Grounding for Video Question Answering
Yicong Li, Xiang Wang, Junbin Xiao, and Tat-Seng Chua

TL;DR
This paper introduces EIGV, a self-interpretable framework for VideoQA that explicitly grounds question-critical cues and distinguishes causal scenes from environment scenes, enhancing interpretability and accuracy.
Contribution
The paper proposes an intrinsic interpretability approach using equivariant and invariant grounding to improve visual-linguistic alignment in VideoQA models.
Findings
EIGV outperforms baselines in accuracy on three datasets.
EIGV provides clear visual explanations of answer reasoning.
The method effectively distinguishes causal scenes from environment scenes.
Abstract
Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals ``What part of the video should the model look at to answer the question?''. Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
