Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir, Globerson

TL;DR
This paper introduces a permutation-invariant structured prediction model for mapping images to scene graphs, leveraging deep learning to better interpret complex visual scenes and achieve state-of-the-art results.
Contribution
It proposes a novel design principle based on permutation invariance for structured prediction models in image understanding tasks.
Findings
Achieves new state-of-the-art on Visual Genome scene graph labeling
Proves a necessary and sufficient condition for permutation-invariant architectures
Outperforms recent approaches in scene graph prediction
Abstract
Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label interactions. However, it is unclear what principles should guide the design of a structured prediction model that utilizes the power of deep learning components. Here we propose a design principle for such architectures that follows from a natural requirement of permutation invariance. We prove a necessary and sufficient characterization for architectures that follow this invariance, and discuss its implication on model design. Finally, we show that the resulting model achieves new state of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
