MAttNet: Modular Attention Network for Referring Expression   Comprehension

Licheng Yu; Zhe Lin; Xiaohui Shen; Jimei Yang; Xin Lu; Mohit Bansal,; Tamara L.Berg

arXiv:1801.08186·cs.CV·March 28, 2018·89 cites

MAttNet: Modular Attention Network for Referring Expression Comprehension

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal,, Tamara L.Berg

PDF

Open Access 1 Repo

TL;DR

This paper introduces MAttNet, a modular attention network that decomposes referring expressions into subject, location, and relationship components, enabling flexible and improved image region localization.

Contribution

The paper presents a novel end-to-end modular attention framework that dynamically combines components for better referring expression comprehension.

Findings

01

Outperforms previous state-of-the-art methods significantly

02

Effective decomposition of expressions improves localization accuracy

03

Demonstrates flexibility in handling diverse expression types

Abstract

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lichengunc/MAttNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning