Visual Question Reasoning on General Dependency Tree
Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin Li, Liang Lin

TL;DR
This paper introduces ACMN, a novel interpretable VQA model that uses adversarial attention and residual composition modules to perform global reasoning based on dependency parse trees, improving reasoning accuracy and explainability.
Contribution
The paper proposes ACMN, a new reasoning network that leverages dependency trees and adversarial attention for better interpretability and reasoning in VQA tasks, reducing reliance on annotations.
Findings
ACMN outperforms existing models on relational datasets.
The model provides interpretable reasoning visualizations.
ACMN effectively combines local evidence for global reasoning.
Abstract
The collaborative reasoning for understanding each image-question pair is very critical but under-explored for an interpretable Visual Question Answering (VQA) system. Although very recent works also tried the explicit compositional processes to assemble multiple sub-tasks embedded in the questions, their models heavily rely on the annotations or hand-crafted rules to obtain valid reasoning layout, leading to either heavy labor or poor performance on composition reasoning. In this paper, to enable global context reasoning for better aligning image and language domains in diverse and unrestricted cases, we propose a novel reasoning network called Adversarial Composition Modular Network (ACMN). This network comprises of two collaborative modules: i) an adversarial attention module to exploit the local visual evidence for each word parsed from the question; ii) a residual composition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
