Discriminative Bimodal Networks for Visual Localization and Detection   with Natural Language Queries

Yuting Zhang; Luyao Yuan; Yijie Guo; Zhiyuan He; I-An Huang; Honglak; Lee

arXiv:1704.03944·cs.CV·April 18, 2017·1 cites

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak, Lee

PDF

Open Access

TL;DR

This paper introduces a discriminative bimodal neural network (DBNet) that improves natural language-based visual localization and detection by effectively pairing image regions with text queries, outperforming previous methods.

Contribution

The paper presents a novel discriminative approach with a classifier trained on negative samples, enhancing localization accuracy and broadening the range of text phrases used for visual entity localization.

Findings

01

DBNet significantly outperforms previous state-of-the-art methods.

02

The approach improves localization accuracy on single images.

03

The method enhances detection capabilities across multiple images.

Abstract

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques