Discovering Spatio-Temporal Rationales for Video Question Answering

Yicong Li; Junbin Xiao; Chun Feng; Xiang Wang; Tat-Seng Chua

arXiv:2307.12058·cs.CV·July 25, 2023

Discovering Spatio-Temporal Rationales for Video Question Answering

Yicong Li, Junbin Xiao, Chun Feng, Xiang Wang, Tat-Seng Chua

PDF

Open Access 1 Repo

TL;DR

This paper introduces TranSTR, a novel Transformer-based model with a differentiable spatio-temporal rationalization module that identifies critical video moments and objects, significantly improving complex VideoQA performance.

Contribution

It proposes a new Spatio-Temporal Rationalization (STR) module and a Transformer-style architecture, TranSTR, for better reasoning in complex VideoQA tasks.

Findings

01

Achieves new state-of-the-art results on four VideoQA datasets.

02

Significantly outperforms previous methods on NExT-QA and Causal-VidQA.

03

Verifies the effectiveness of STR and answer interaction mechanisms.

Abstract

This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yl3800/transtr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning