Dense but Efficient VideoQA for Intricate Compositional Reasoning

Jihyeon Lee; Wooyoung Kang; Eun-Sol Kim

arXiv:2210.10300·cs.CV·October 20, 2022

Dense but Efficient VideoQA for Intricate Compositional Reasoning

Jihyeon Lee, Wooyoung Kang, Eun-Sol Kim

PDF

Open Access 1 Video

TL;DR

This paper introduces a transformer-based VideoQA model with deformable attention that efficiently handles complex, compositional questions over long videos by focusing on informative features and understanding question relations.

Contribution

It proposes a novel dense yet efficient VideoQA method combining deformable attention and dependency-aware language modeling for complex reasoning tasks.

Findings

01

Outperforms baseline models on complex VideoQA datasets

02

Efficiently samples informative visual features over long video sequences

03

Effectively models relations within complex question sentences

Abstract

It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes. However, long videos inevitably contain complex and compositional semantic structures along with the spatio-temporal axis, which requires a model to understand the compositional structures inherent in the videos. In this paper, we suggest a new compositional VideoQA method based on transformer architecture with a deformable attention mechanism to address the complex VideoQA tasks. The deformable attentions are introduced to sample a subset of informative visual features from the dense visual feature map to cover a temporally long range of frames efficiently. Furthermore, the dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the relations among question words.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dense but Efficient VideoQA for Intricate Compositional Reasoning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition