Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
Yi Cheng, Hehe Fan, Dongyun Lin, Ying Sun, Mohan Kankanhalli, and, Joo-Hwee Lim

TL;DR
This paper introduces a novel Keyword-aware Relative Spatio-Temporal graph network for VideoQA that enhances relation modeling by incorporating keyword attention and disentangling spatial and temporal reasoning, leading to improved performance.
Contribution
The paper proposes a new KRST graph network that integrates keyword attention and relative relation modeling, and separates spatial and temporal reasoning for better VideoQA performance.
Findings
KRST outperforms state-of-the-art methods on TGIF-QA, MSVD-QA, and MSRVTT-QA datasets.
Keyword attention improves question feature relevance.
Disentangling spatial and temporal graphs enhances reasoning accuracy.
Abstract
The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsAttentive Walk-Aggregating Graph Neural Network
