Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He

TL;DR
This paper introduces a Hierarchical Convolutional Self-Attention network for long-form video question answering, effectively modeling long-range dependencies and reducing computational costs compared to prior short-form focused methods.
Contribution
It proposes a novel hierarchical convolutional self-attention encoder-decoder architecture tailored for long-form videos, improving dependency modeling and efficiency.
Findings
Outperforms existing methods on long-form video QA tasks.
Efficiently models long-range dependencies in videos.
Reduces computational costs significantly.
Abstract
Open-ended video question answering aims to automatically generate the natural-language answer from referenced video contents according to the given question. Currently, most existing approaches focus on short-form video question answering with multi-modal recurrent encoder-decoder networks. Although these works have achieved promising performance, they may still be ineffectively applied to long-form video question answering due to the lack of long-range dependency modeling and the suffering from the heavy computational cost. To tackle these problems, we propose a fast Hierarchical Convolutional Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
