Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Yijia Wang, Yiqing Shen, Weiming Chen, Zhihai He

TL;DR
This paper introduces CIELR, a novel image editing method that converts complex instructions into explicit actions using LLM reasoning, avoiding costly joint fine-tuning of models, and achieves state-of-the-art results on a new benchmark.
Contribution
The paper proposes CIELR, a new approach that simplifies complex image editing by reasoning with LLMs and structured representations, eliminating the need for joint fine-tuning.
Findings
CIELR surpasses previous methods by 9.955 dB in PSNR.
The method effectively preserves image regions during editing.
A new benchmark CIEBench is introduced for reasoning-based image editing.
Abstract
Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
