Generalized Hadamard-Product Fusion Operators for Visual Question Answering
Brendan Duke, Graham W. Taylor

TL;DR
This paper introduces a generalized class of multimodal fusion operators for visual question answering, demonstrating that specific instantiations improve accuracy and suggesting potential for discovering even better operators through architecture search.
Contribution
The paper proposes a generalized framework for multimodal fusion operators in VQA, including novel components like Nonlinearity Ensembling and Feature Gating, achieving improved accuracy.
Findings
Achieved 1.1% improvement in VQA 2.0 test-dev accuracy.
Identified that specific instantiations outperform baseline fusion methods.
Proposed a generalized fusion operator class as a search space for future architecture optimization.
Abstract
We propose a generalized class of multimodal fusion operators for the task of visual question answering (VQA). We identify generalizations of existing multimodal fusion operators based on the Hadamard product, and show that specific non-trivial instantiations of this generalized fusion operator exhibit superior performance in terms of OpenEnded accuracy on the VQA task. In particular, we introduce Nonlinearity Ensembling, Feature Gating, and post-fusion neural network layers as fusion operator components, culminating in an absolute percentage point improvement of on the VQA 2.0 test-dev set over baseline fusion operators, which use the same features as input. We use our findings as evidence that our generalized class of fusion operators could lead to the discovery of even superior task-specific operators when used as a search space in an architecture search over fusion operators.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
