Generative Visual Chain-of-Thought for Image Editing

Zijin Yin; Tiankai Hang; Yiji Cheng; Shiyi Zhang; Runze He; Yu Xu; Chunyu Wang; Bing Li; Zheng Chang; Kongming Liang; Qinglin Lu; Zhanyu Ma

arXiv:2603.01893·cs.CV·March 17, 2026

Generative Visual Chain-of-Thought for Image Editing

Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Bing Li, Zheng Chang, Kongming Liang, Qinglin Lu, Zhanyu Ma

PDF

Open Access 1 Datasets

TL;DR

This paper introduces GVCoT, a novel framework for image editing that combines visual reasoning with spatial cues, trained on a large dataset, and demonstrates superior performance on new benchmarks.

Contribution

The paper proposes GVCoT, a unified visual reasoning framework for image editing that jointly optimizes reasoning and editing in an end-to-end manner, and introduces new datasets and benchmarks.

Findings

01

GVCoT outperforms state-of-the-art models on SREdit-Bench and ImgEdit.

02

Constructed GVCoT-Edit-Instruct dataset with 1.8 million samples.

03

Progressive training strategy enhances localization and editing quality.

Abstract

Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zjYinnnn/SREdit-Bench
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection