Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Yijia Wang; Yiqing Shen; Weiming Chen; Zhihai He

arXiv:2510.27335·cs.CV·November 3, 2025

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Yijia Wang, Yiqing Shen, Weiming Chen, Zhihai He

PDF

Open Access

TL;DR

This paper introduces CIELR, a novel image editing method that converts complex instructions into explicit actions using LLM reasoning, avoiding costly joint fine-tuning of models, and achieves state-of-the-art results on a new benchmark.

Contribution

The paper proposes CIELR, a new approach that simplifies complex image editing by reasoning with LLMs and structured representations, eliminating the need for joint fine-tuning.

Findings

01

CIELR surpasses previous methods by 9.955 dB in PSNR.

02

The method effectively preserves image regions during editing.

03

A new benchmark CIEBench is introduced for reasoning-based image editing.

Abstract

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship