REX: Reasoning-aware and Grounded Explanation
Shi Chen, Qi Zhao

TL;DR
This paper introduces REX, a reasoning-aware and grounded explanation framework for visual reasoning, which generates multi-modal explanations by traversing reasoning steps and grounding keywords, improving interpretability and reasoning accuracy.
Contribution
The paper proposes a novel multi-modal explanation method that models word-region correspondence and constructs a large dataset of explanations, advancing interpretability in visual reasoning.
Findings
Enhanced visual grounding capability
Improved interpretability and reasoning performance
Effective under multi-task and transfer learning settings
Abstract
Effectiveness and interpretability are two essential properties for trustworthy AI systems. Most recent studies in visual reasoning are dedicated to improving the accuracy of predicted answers, and less attention is paid to explaining the rationales behind the decisions. As a result, they commonly take advantage of spurious biases instead of actually reasoning on the visual-textual data, and have yet developed the capability to explain their decision making by considering key information from both modalities. This paper aims to close the gap from three distinct perspectives: first, we define a new type of multi-modal explanations that explain the decisions by progressively traversing the reasoning process and grounding keywords in the images. We develop a functional program to sequentially execute different reasoning steps and construct a new dataset with 1,040,830 multi-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Video Analysis and Summarization
