An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Zihan Liang; Jiahao Sun; Haoran Ma

arXiv:2508.17435·cs.CV·August 26, 2025

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Zihan Liang, Jiahao Sun, Haoran Ma

PDF

TL;DR

This paper presents RefineEdit-Agent, an innovative, training-free framework combining LLMs and LVLMs for iterative, fine-grained image editing with robust context understanding and feedback, outperforming existing methods on a new benchmark.

Contribution

Introduction of RefineEdit-Agent, a novel agent framework that integrates LLMs and LVLMs for complex iterative image editing without additional training.

Findings

01

RefineEdit-Agent achieves an average score of 3.67 on LongBench-T2I-Edit.

02

Outperforms state-of-the-art baselines in image editing tasks.

03

Validated through extensive experiments, ablations, and human evaluations.

Abstract

Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.