An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing
Zihan Liang, Jiahao Sun, Haoran Ma

TL;DR
This paper presents RefineEdit-Agent, an innovative, training-free framework combining LLMs and LVLMs for iterative, fine-grained image editing with robust context understanding and feedback, outperforming existing methods on a new benchmark.
Contribution
Introduction of RefineEdit-Agent, a novel agent framework that integrates LLMs and LVLMs for complex iterative image editing without additional training.
Findings
RefineEdit-Agent achieves an average score of 3.67 on LongBench-T2I-Edit.
Outperforms state-of-the-art baselines in image editing tasks.
Validated through extensive experiments, ablations, and human evaluations.
Abstract
Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
