TL;DR
IMAGAgent introduces a multi-turn image editing framework that uses a closed-loop 'plan-execute-reflect' mechanism, improving accuracy and reducing errors in complex, multi-step image editing tasks.
Contribution
It presents a novel constraint-aware planning and reflection system that enhances multi-turn image editing through adaptive scheduling and feedback integration.
Findings
Outperforms existing methods in instruction consistency and editing precision.
Demonstrates significant improvements on MTEditBench and MagicBrush datasets.
Achieves higher overall image quality in multi-turn editing tasks.
Abstract
Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbf{IMAGAgent}, a multi-turn image editing agent framework based on a "plan-execute-reflect" closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
