Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension
Luca Parolari, Elena Izzo, Lamberto Ballan

TL;DR
This paper introduces Harlequin, a synthetic data generation framework for Referring Expression Comprehension that creates large-scale, annotated datasets to improve model training without manual labeling.
Contribution
It presents a novel image synthesis pipeline that generates a large artificial dataset for REC, enhancing training and performance of deep learning models.
Findings
Pre-training on Harlequin improves REC model performance.
Harlequin dataset contains over 1 million queries.
Synthetic data reduces reliance on manual annotations.
Abstract
Referring Expression Comprehension (REC) aims to identify a particular object in a scene by a natural language expression, and is an important topic in visual language understanding. State-of-the-art methods for this task are based on deep learning, which generally requires expensive and manually labeled annotations. Some works tackle the problem with limited-supervision learning or relying on Large Vision and Language Models. However, the development of techniques to synthesize labeled data is overlooked. In this paper, we propose a novel framework that generates artificial data for the REC task, taking into account both textual and visual modalities. At first, our pipeline processes existing data to create variations in the annotations. Then, it generates an image using altered annotations as guidance. The result of this pipeline is a new dataset, called Harlequin, made by more than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor perception and design
