InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo

TL;DR
InterCoG introduces a novel framework for fine-grained image editing in complex scenes by combining spatial reasoning, visual grounding, and description rewriting, supported by new datasets and auxiliary training modules.
Contribution
It presents a new reasoning framework and datasets for spatially precise image editing in multi-entity scenes, enhancing interpretability and accuracy.
Findings
InterCoG outperforms existing methods in precise multi-entity scene editing.
The auxiliary modules improve spatial localization and reasoning interpretability.
GroundEdit-45K dataset enables better evaluation of grounding-aware editing.
Abstract
Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
