Discovering Spatio-Temporal Rationales for Video Question Answering
Yicong Li, Junbin Xiao, Chun Feng, Xiang Wang, Tat-Seng Chua

TL;DR
This paper introduces TranSTR, a novel Transformer-based model with a differentiable spatio-temporal rationalization module that identifies critical video moments and objects, significantly improving complex VideoQA performance.
Contribution
It proposes a new Spatio-Temporal Rationalization (STR) module and a Transformer-style architecture, TranSTR, for better reasoning in complex VideoQA tasks.
Findings
Achieves new state-of-the-art results on four VideoQA datasets.
Significantly outperforms previous methods on NExT-QA and Causal-VidQA.
Verifies the effectiveness of STR and answer interaction mechanisms.
Abstract
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
