Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension
Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K. Wong, Qi Wu

TL;DR
This paper introduces Cops-Ref, a challenging new dataset and task for referring expression comprehension that emphasizes complex reasoning and distractor handling, revealing limitations of current models and encouraging deeper visual reasoning research.
Contribution
It presents a novel dataset with compositional expressions and a challenging test setting, advancing the evaluation of reasoning capabilities in referring expression comprehension models.
Findings
Existing models perform poorly on the new dataset.
A modular hard mining strategy improves model performance.
The dataset reveals significant room for improvement in visual reasoning.
Abstract
Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
