Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji; Chenyang Qi; Qifeng Chen

arXiv:2602.22624·cs.CV·February 27, 2026

Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji, Chenyang Qi, Qifeng Chen

PDF

Open Access

TL;DR

This paper introduces a multi-modality model that enhances instruction-based image editing by integrating planning, reasoning, and generation, leading to improved handling of complex real-world images.

Contribution

It proposes a novel multi-modality chain of thought framework that combines reasoning and generation for more effective instruction-based image editing.

Findings

01

Achieves competitive editing performance on complex images.

02

Effectively integrates reasoning and generation in image editing.

03

Outperforms prior models in handling real-world editing tasks.

Abstract

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Multimodal Machine Learning Applications