InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan; Fan Li; Chunwei Wang; Hao Wu; Mingwen Shao; Wangmeng Zuo

arXiv:2603.01586·cs.CV·May 11, 2026

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo

PDF

TL;DR

InterCoG introduces a novel framework for fine-grained image editing in complex scenes by combining spatial reasoning, visual grounding, and description rewriting, supported by new datasets and auxiliary training modules.

Contribution

It presents a new reasoning framework and datasets for spatially precise image editing in multi-entity scenes, enhancing interpretability and accuracy.

Findings

01

InterCoG outperforms existing methods in precise multi-entity scene editing.

02

The auxiliary modules improve spatial localization and reasoning interpretability.

03

GroundEdit-45K dataset enables better evaluation of grounding-aware editing.

Abstract

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.