Referring Expression Comprehension: A Survey of Methods and Datasets
Yanyuan Qiao, Chaorui Deng, Qi Wu

TL;DR
This survey reviews recent methods and datasets for referring expression comprehension, highlighting the challenges, architectures, and future directions in localizing objects in images based on natural language descriptions.
Contribution
It provides a comprehensive classification of REC methods, compares state-of-the-art approaches, and discusses future research directions including compositional reasoning.
Findings
Joint embedding of images and expressions is common in REC methods.
Graph-based models effectively utilize structured representations.
Datasets vary in size and complexity, impacting model evaluation.
Abstract
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language. Different from the object detection task that queried object labels have been pre-defined, the REC problem only can observe the queries during the test. It thus more challenging than a conventional computer vision problem. This task has attracted a lot of attention from both computer vision and natural language processing community, and several lines of work have been proposed, from CNN-RNN model, modular network to complex graph-based model. In this survey, we first examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to encode the visual and textual modalities. In particular, we examine the common approach of joint embedding images and expressions to a common feature space.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
