Instruction-based Image Editing with Planning, Reasoning, and Generation
Liya Ji, Chenyang Qi, Qifeng Chen

TL;DR
This paper introduces a multi-modality model that enhances instruction-based image editing by integrating planning, reasoning, and generation, leading to improved handling of complex real-world images.
Contribution
It proposes a novel multi-modality chain of thought framework that combines reasoning and generation for more effective instruction-based image editing.
Findings
Achieves competitive editing performance on complex images.
Effectively integrates reasoning and generation in image editing.
Outperforms prior models in handling real-world editing tasks.
Abstract
Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Multimodal Machine Learning Applications
