Video Question Answering via Attribute-Augmented Attention Network Learning
Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, Yueting Zhuang

TL;DR
This paper introduces an attribute-augmented attention network that models temporal dynamics and performs multi-step reasoning for improved video question answering, addressing limitations of static image-based methods.
Contribution
It proposes a novel attribute-augmented attention framework with joint attribute detection and multi-step reasoning for video question answering.
Findings
Effective on multiple-choice and open-ended tasks
Improves performance over existing methods
Constructed a large-scale VQA dataset
Abstract
Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question. However, the existing visual question answering approaches mainly tackle the problem of static image question, which may be ineffectively for video question answering due to the insufficiency of modeling the temporal dynamics of video contents. In this paper, we study the problem of video question answering by modeling its temporal dynamics with frame-level attention mechanism. We propose the attribute-augmented attention network learning framework that enables the joint frame-level attribute detection and unified video representation learning for video question answering. We then incorporate the multi-step reasoning process for our proposed attention network to further improve the performance. We construct a large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
