Dense but Efficient VideoQA for Intricate Compositional Reasoning
Jihyeon Lee, Wooyoung Kang, Eun-Sol Kim

TL;DR
This paper introduces a transformer-based VideoQA model with deformable attention that efficiently handles complex, compositional questions over long videos by focusing on informative features and understanding question relations.
Contribution
It proposes a novel dense yet efficient VideoQA method combining deformable attention and dependency-aware language modeling for complex reasoning tasks.
Findings
Outperforms baseline models on complex VideoQA datasets
Efficiently samples informative visual features over long video sequences
Effectively models relations within complex question sentences
Abstract
It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes. However, long videos inevitably contain complex and compositional semantic structures along with the spatio-temporal axis, which requires a model to understand the compositional structures inherent in the videos. In this paper, we suggest a new compositional VideoQA method based on transformer architecture with a deformable attention mechanism to address the complex VideoQA tasks. The deformable attentions are introduced to sample a subset of informative visual features from the dense visual feature map to cover a temporally long range of frames efficiently. Furthermore, the dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the relations among question words.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Dense but Efficient VideoQA for Intricate Compositional Reasoning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
