Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan, Massimo Poesio

TL;DR
This paper introduces a three-tier data synthesis framework to generate scalable, realistic supervision data for dialogue grounding tasks, significantly improving model performance across domains.
Contribution
The novel three-tier data synthesis method balances realism and controllability, addressing data scarcity and domain shift in dialogue grounding models.
Findings
Fine-tuning on synthesized data improves performance across evaluation metrics.
The framework enhances model robustness to domain shifts.
Synthesized data effectively bridges the gap caused by limited annotated datasets.
Abstract
Dialogue-Based Generalized Referring Expression Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
