# Explainable and Explicit Visual Reasoning over Scene Graphs

**Authors:** Jiaxin Shi, Hanwang Zhang, Juanzi Li

arXiv: 1812.01855 · 2019-03-20

## TL;DR

This paper introduces XNMs, a new neural module framework that uses scene graphs for transparent and structured visual reasoning, significantly reducing parameters and enabling explicit reasoning flow tracing.

## Contribution

The paper proposes XNMs, a flexible, scene graph-based neural module approach that enhances explainability and efficiency in visual reasoning tasks.

## Key findings

- Achieves 100% accuracy on CLEVR with perfect scene graphs
- Maintains 67.5% accuracy on VQAv2.0 with noisy scene graphs
- Reduces model parameters by 10 to 100 times compared to previous methods

## Abstract

We aim to dismantle the prevalent black-box neural architectures used in complex visual reasoning tasks, into the proposed eXplainable and eXplicit Neural Modules (XNMs), which advance beyond existing neural module networks towards using scene graphs --- objects as nodes and the pairwise relationships as edges --- for explainable and explicit reasoning with structured knowledge. XNMs allow us to pay more attention to teach machines how to "think", regardless of what they "look". As we will show in the paper, by using scene graphs as an inductive bias, 1) we can design XNMs in a concise and flexible fashion, i.e., XNMs merely consist of 4 meta-types, which significantly reduce the number of parameters by 10 to 100 times, and 2) we can explicitly trace the reasoning-flow in terms of graph attentions. XNMs are so generic that they support a wide range of scene graph implementations with various qualities. For example, when the graphs are detected perfectly, XNMs achieve 100% accuracy on both CLEVR and CLEVR CoGenT, establishing an empirical performance upper-bound for visual reasoning; when the graphs are noisily detected from real-world images, XNMs are still robust to achieve a competitive 67.5% accuracy on VQAv2.0, surpassing the popular bag-of-objects attention models without graph structures.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.01855/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1812.01855/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1812.01855/full.md

---
Source: https://tomesphere.com/paper/1812.01855