Unsupervised Textual Grounding: Linking Words to Image Concepts
Raymond A. Yeh, Minh N. Do, Alexander G. Schwing

TL;DR
This paper introduces an unsupervised method for linking words to image objects in textual grounding, eliminating the need for large annotated datasets and outperforming existing supervised approaches.
Contribution
The authors propose a novel unsupervised approach using hypothesis testing to connect words with image concepts, reducing reliance on labeled data.
Findings
Outperforms baselines by 7.98% on ReferIt Game dataset
Outperforms baselines by 6.96% on Flickr30k dataset
Demonstrates effectiveness of unsupervised approach in textual grounding
Abstract
Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
