Structured Two-stream Attention Network for Video Question Answering

Lianli Gao; Pengpeng Zeng; Jingkuan Song; Yuan-Fang Li; Wu Liu; Tao; Mei; Heng Tao Shen

arXiv:2206.01017·cs.CV·June 3, 2022

Structured Two-stream Attention Network for Video Question Answering

Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao, Mei, Heng Tao Shen

PDF

Open Access

TL;DR

This paper introduces a Structured Two-stream Attention network (STA) for video question answering, effectively capturing long-range temporal and spatial structures to improve answer accuracy on large-scale datasets.

Contribution

The paper proposes a novel STA model that jointly reasons across spatial and temporal video structures with structured attention, advancing video QA capabilities.

Findings

01

Significantly outperforms existing methods on TGIF-QA dataset.

02

Achieves 13-14% improvement on key video QA tasks.

03

Effectively localizes relevant visual instances and reduces background influence.

Abstract

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsAttentive Walk-Aggregating Graph Neural Network