RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Bimsara Pathiraja; Maitreya Patel; Shivam Singh; Yezhou Yang; Chitta Baral

arXiv:2506.03448·cs.CV·June 5, 2025

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Bimsara Pathiraja, Maitreya Patel, Shivam Singh, Yezhou Yang, Chitta Baral

PDF

Open Access

TL;DR

RefEdit introduces a new benchmark and a novel instruction-based image editing model that significantly improves editing performance in complex scenes with multiple entities, outperforming large-scale baselines.

Contribution

The paper presents RefEdit, a scalable synthetic data pipeline and a new benchmark, to enhance instruction-based image editing in complex scenes, achieving state-of-the-art results.

Findings

01

RefEdit outperforms models trained on millions of samples.

02

RefEdit achieves state-of-the-art results on referring expression tasks.

03

RefEdit improves traditional image editing benchmarks.

Abstract

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques