Unsupervised Textual Grounding: Linking Words to Image Concepts

Raymond A. Yeh; Minh N. Do; Alexander G. Schwing

arXiv:1803.11185·cs.CV·March 30, 2018

Unsupervised Textual Grounding: Linking Words to Image Concepts

Raymond A. Yeh, Minh N. Do, Alexander G. Schwing

PDF

TL;DR

This paper introduces an unsupervised method for linking words to image objects in textual grounding, eliminating the need for large annotated datasets and outperforming existing supervised approaches.

Contribution

The authors propose a novel unsupervised approach using hypothesis testing to connect words with image concepts, reducing reliance on labeled data.

Findings

01

Outperforms baselines by 7.98% on ReferIt Game dataset

02

Outperforms baselines by 6.96% on Flickr30k dataset

03

Demonstrates effectiveness of unsupervised approach in textual grounding

Abstract

Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.