TL;DR
This paper introduces a novel syntax-aware attention model using tree structures for improved video question answering, especially on complex questions, by leveraging sentence parse trees and hierarchical attention mechanisms.
Contribution
The paper proposes the HTreeMN model that incorporates sentence parse trees into attention mechanisms, enhancing understanding of complex questions in video QA tasks.
Findings
Outperforms existing attention models on complex questions
Effective utilization of sentence parse trees improves accuracy
Hierarchical attention distills relevant features efficiently
Abstract
We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the Extended Soft Attention (E-SA) model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMemory Network
