Scene Graph Generation from Objects, Phrases and Region Captions

Yikang Li; Wanli Ouyang; Bolei Zhou; Kun Wang; Xiaogang Wang

arXiv:1707.09700·cs.CV·September 18, 2017·29 cites

Scene Graph Generation from Objects, Phrases and Region Captions

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, Xiaogang Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces MSDN, a neural network that jointly performs object detection, scene graph generation, and region captioning by leveraging their semantic connections, resulting in improved performance across tasks.

Contribution

The paper proposes a novel end-to-end multi-level neural network model that jointly learns three scene understanding tasks using a dynamic graph and message passing.

Findings

01

Outperforms previous models on scene graph generation by over 3%.

02

Joint learning improves performance across all three tasks.

03

Effective multi-level semantic alignment enhances scene understanding.

Abstract

Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yikang-li/MSDN
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition