# Differentiable Scene Graphs

**Authors:** Moshiko Raboh, Roei Herzig, Gal Chechik, Jonathan Berant, Amir, Globerson

arXiv: 1902.10200 · 2020-03-17

## TL;DR

This paper introduces Differentiable Scene Graphs (DSGs), a novel dense, end-to-end trainable image representation that improves visual reasoning by enabling joint training for tasks like identifying referring relationships.

## Contribution

The paper proposes DSGs, a differentiable scene graph representation that allows end-to-end training for visual reasoning tasks, overcoming the non-differentiability of traditional scene graphs.

## Key findings

- Achieves state-of-the-art results on referring relationship identification.
- Effective in three benchmark datasets: Visual Genome, VRD, and CLEVR.
- Enables joint training with downstream tasks for improved performance.

## Abstract

Reasoning about complex visual scenes involves perception of entities and their relations. Scene graphs provide a natural representation for reasoning tasks, by assigning labels to both entities (nodes) and relations (edges). Unfortunately, reasoning systems based on SGs are typically trained in a two-step procedure: First, training a model to predict SGs from images; Then, a separate model is created to reason based on predicted SGs. In many domains, it is preferable to train systems jointly in an end-to-end manner, but SGs are not commonly used as intermediate components in visual reasoning systems because being discrete and sparse, scene-graph representations are non-differentiable and difficult to optimize. Here we propose Differentiable Scene Graphs (DSGs), an image representation that is amenable to differentiable end-to-end optimization, and requires supervision only from the downstream tasks. DSGs provide a dense representation for all regions and pairs of regions, and do not spend modelling capacity on areas of the images that do not contain objects or relations of interest. We evaluate our model on the challenging task of identifying referring relationships (RR) in three benchmark datasets, Visual Genome, VRD and CLEVR. We describe a multi-task objective, and train in an end-to-end manner supervised by the downstream RR task. Using DSGs as an intermediate representation leads to new state-of-the-art performance.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.10200/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1902.10200/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/1902.10200/full.md

---
Source: https://tomesphere.com/paper/1902.10200