TL;DR
EditRefiner introduces a human-aligned, hierarchical framework for image editing refinement, leveraging a new dataset and perception-reasoning-action-evaluation loop to improve local corrections and perceptual quality.
Contribution
It presents a novel dataset and a hierarchical agentic framework for human-aligned, self-corrective image editing refinement, outperforming existing methods.
Findings
Outperforms state-of-the-art in distortion localization and diagnose accuracy.
Achieves higher human perception alignment.
Establishes a new paradigm for self-corrective image editing.
Abstract
Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
