High-Order Attention Models for Visual Question Answering
Idan Schwartz, Alexander G. Schwing, Tamir Hazan

TL;DR
This paper introduces a novel high-order attention mechanism that captures complex correlations between visual and textual data, significantly improving performance on visual question answering tasks.
Contribution
It proposes a new high-order attention model that effectively learns complex cross-modal correlations for VQA, advancing the state-of-the-art.
Findings
Achieved state-of-the-art results on the VQA dataset.
Demonstrated the effectiveness of high-order correlations in attention mechanisms.
Improved accuracy over existing models in VQA.
Abstract
The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsFactor Graph Attention
