Improving Referring Expression Grounding with Cross-modal   Attention-guided Erasing

Xihui Liu; Zihao Wang; Jing Shao; Xiaogang Wang; Hongsheng Li

arXiv:1903.00839·cs.CV·April 3, 2019·24 cites

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, Hongsheng Li

PDF

Open Access

TL;DR

This paper introduces a cross-modal attention-guided erasing technique that improves referring expression grounding by encouraging models to discover diverse visual-textual correspondences, achieving state-of-the-art results.

Contribution

The paper proposes a novel erasing approach that discards dominant features to enhance cross-modal alignment in referring expression grounding.

Findings

01

Achieves state-of-the-art performance on three datasets.

02

Effectively discovers complementary visual-textual correspondences.

03

Improves model robustness by generating difficult training samples.

Abstract

Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling