TL;DR
This paper explores the structure of scene graphs in visual scenes, revealing recurring motifs and introducing a new architecture that leverages these motifs to improve scene graph parsing accuracy.
Contribution
The paper provides new insights into motif patterns in scene graphs and proposes Stacked Motif Networks to better capture higher order motifs, improving parsing performance.
Findings
Object labels strongly predict relation labels.
Over 50% of scene graphs contain motifs with multiple relations.
The proposed architecture outperforms previous methods by up to 7.1%.
Abstract
We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
