Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question   Answering

Yi Cheng; Hehe Fan; Dongyun Lin; Ying Sun; Mohan Kankanhalli; and; Joo-Hwee Lim

arXiv:2307.13250·cs.CV·July 26, 2023

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Yi Cheng, Hehe Fan, Dongyun Lin, Ying Sun, Mohan Kankanhalli, and, Joo-Hwee Lim

PDF

Open Access

TL;DR

This paper introduces a novel Keyword-aware Relative Spatio-Temporal graph network for VideoQA that enhances relation modeling by incorporating keyword attention and disentangling spatial and temporal reasoning, leading to improved performance.

Contribution

The paper proposes a new KRST graph network that integrates keyword attention and relative relation modeling, and separates spatial and temporal reasoning for better VideoQA performance.

Findings

01

KRST outperforms state-of-the-art methods on TGIF-QA, MSVD-QA, and MSRVTT-QA datasets.

02

Keyword attention improves question feature relevance.

03

Disentangling spatial and temporal graphs enhances reasoning accuracy.

Abstract

The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAttentive Walk-Aggregating Graph Neural Network