TL;DR
This paper introduces visual reasoning primitives that enhance interpretability and achieve state-of-the-art accuracy in visual question answering, bridging the gap between model transparency and high performance.
Contribution
The authors propose a set of composable visual reasoning primitives that improve interpretability while maintaining high accuracy in complex visual reasoning tasks.
Findings
Achieved 99.1% accuracy on CLEVR dataset.
Significantly improved generalization on CoGenT with over 20 percentage points.
Enabled diagnosis of model strengths and weaknesses through primitive outputs.
Abstract
Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
