Interpretable Neural Computation for Real-World Compositional Visual Question Answering
Ruixue Tang, Chao Ma

TL;DR
This paper introduces an interpretable framework for real-world compositional visual question answering that combines explicit reasoning with deep learning, outperforming prior methods on the GQA benchmark.
Contribution
It proposes a novel hybrid approach that integrates symbolic program execution with pre-trained encoders for improved interpretability and accuracy in VQA.
Findings
Outperforms previous compositional models on GQA
Achieves competitive accuracy with monolithic models
Surpasses others in validity, plausibility, and distribution metrics
Abstract
There are two main lines of research on visual question answering (VQA): compositional model with explicit multi-hop reasoning, and monolithic network with implicit reasoning in the latent feature space. The former excels in interpretability and compositionality but fails on real-world images, while the latter usually achieves better performance due to model flexibility and parameter efficiency. We aim to combine the two to build an interpretable framework for real-world compositional VQA. In our framework, images and questions are disentangled into scene graphs and programs, and a symbolic program executor runs on them with full transparency to select the attention regions, which are then iteratively passed to a visual-linguistic pre-trained encoder to predict answers. Experiments conducted on the GQA benchmark demonstrate that our framework outperforms the compositional prior arts and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsInterpretability
