Open-Ended Long-Form Video Question Answering via Hierarchical   Convolutional Self-Attention Networks

Zhu Zhang; Zhou Zhao; Zhijie Lin; Jingkuan Song; Xiaofei He

arXiv:1906.12158·cs.CV·July 1, 2019·1 cites

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He

PDF

Open Access

TL;DR

This paper introduces a Hierarchical Convolutional Self-Attention network for long-form video question answering, effectively modeling long-range dependencies and reducing computational costs compared to prior short-form focused methods.

Contribution

It proposes a novel hierarchical convolutional self-attention encoder-decoder architecture tailored for long-form videos, improving dependency modeling and efficiency.

Findings

01

Outperforms existing methods on long-form video QA tasks.

02

Efficiently models long-range dependencies in videos.

03

Reduces computational costs significantly.

Abstract

Open-ended video question answering aims to automatically generate the natural-language answer from referenced video contents according to the given question. Currently, most existing approaches focus on short-form video question answering with multi-modal recurrent encoder-decoder networks. Although these works have achieved promising performance, they may still be ineffectively applied to long-form video question answering due to the lack of long-range dependency modeling and the suffering from the heavy computational cost. To tackle these problems, we propose a fast Hierarchical Convolutional Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning