Structured Two-stream Attention Network for Video Question Answering
Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao, Mei, Heng Tao Shen

TL;DR
This paper introduces a Structured Two-stream Attention network (STA) for video question answering, effectively capturing long-range temporal and spatial structures to improve answer accuracy on large-scale datasets.
Contribution
The paper proposes a novel STA model that jointly reasons across spatial and temporal video structures with structured attention, advancing video QA capabilities.
Findings
Significantly outperforms existing methods on TGIF-QA dataset.
Achieves 13-14% improvement on key video QA tasks.
Effectively localizes relevant visual instances and reduces background influence.
Abstract
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsAttentive Walk-Aggregating Graph Neural Network
