Modeling Relationships in Referential Expressions with Compositional   Modular Networks

Ronghang Hu; Marcus Rohrbach; Jacob Andreas; Trevor Darrell; Kate; Saenko

arXiv:1611.09978·cs.CV·December 1, 2016·21 cites

Modeling Relationships in Referential Expressions with Compositional Modular Networks

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate, Saenko

PDF

Open Access 2 Repos

TL;DR

This paper introduces Compositional Modular Networks, a novel neural architecture that analyzes and grounds referential expressions in images by decomposing them into entities and relationships, outperforming existing methods.

Contribution

The paper presents a new end-to-end neural architecture with modular components for analyzing and grounding referential expressions in images, capturing relationships beyond fixed categories.

Findings

01

Outperforms state-of-the-art on multiple datasets

02

Effectively decomposes expressions into entities and relationships

03

Learns linguistic and visual inference jointly

Abstract

People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition