REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering
Siwen Luo, Soyeon Caren Han, Kaiyuan Sun, Josiah Poon

TL;DR
REXUP is a novel deep reasoning model for visual question answering that effectively captures step-by-step reasoning and complex object relationships using structured visual and textual information, outperforming previous methods.
Contribution
The paper introduces REXUP, a deep reasoning VQA model with explicit visual structure-aware textual information and dual-branch architecture, advancing the state-of-the-art performance.
Findings
Achieves 92.7% validation accuracy on GQA dataset
Outperforms previous state-of-the-art methods
Demonstrates effectiveness through extensive ablation studies
Abstract
Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect, either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question in an image with simple objects. In this paper, we propose a deep reasoning VQA model with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting a complex object-relationship in photo-realistic images. REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
