SIDiffAgent: Self-Improving Diffusion Agent

Shivank Garg; Ayush Singh; Gaurav Kumar Nayak

arXiv:2602.02051·cs.AI·February 3, 2026

SIDiffAgent: Self-Improving Diffusion Agent

Shivank Garg, Ayush Singh, Gaurav Kumar Nayak

PDF

Open Access 3 Reviews

TL;DR

SIDiffAgent is a training-free, self-improving framework for text-to-image diffusion models that autonomously manages prompts, detects errors, and iteratively enhances output quality using stored past experiences.

Contribution

It introduces SIDiffAgent, a novel agentic framework leveraging Qwen models for autonomous prompt management and self-improvement without additional training.

Findings

01

Achieved an average VQA score of 0.884 on GenAIBench.

02

Significantly outperformed existing open-source and proprietary models.

03

Demonstrated effective artifact correction and prompt handling.

Abstract

Text-to-image diffusion models have revolutionized generative AI, enabling high-quality and photorealistic image synthesis. However, their practical deployment remains hindered by several limitations: sensitivity to prompt phrasing, ambiguity in semantic interpretation (e.g., ``mouse" as animal vs. a computer peripheral), artifacts such as distorted anatomy, and the need for carefully engineered input prompts. Existing methods often require additional training and offer limited controllability, restricting their adaptability in real-world applications. We introduce Self-Improving Diffusion Agent (SIDiffAgent), a training-free agentic framework that leverages the Qwen family of models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to address these challenges. SIDiffAgent autonomously manages prompt engineering, detects and corrects poor generations, and performs fine-grained artifact…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- SIDiffAgent improves iteratively at inference via trajectory memory. The design extends prior works such as T2I-Copilot by introducing a multi-layer agent hierarchy and adaptive negative prompt generation. - The empirical performance gains of SIDiffAgent over open-source and proprietary baselines seems promising. - Implementation details (such as the prompts of the sub agents, algorithms and hyperparameters) are provided in the appendix.

Weaknesses

- Marginal algorithmic novelty: The paper’s contribution lies primarily in system integration rather than introducing a fundamentally new self-adapting optimization algorithm for diffusion models. - System complexity and reproducibility: The multi-agent framework involves many interdependent agents and prompts. The training-free claim is valid, but the inference-time across multiple agent calls can be computationally heavy and difficult to replicate. - Evaluation: The claiming of perceptual ali

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is clear and easy to follow. 2. The plug-and-play multi-agent system improves the performance a lot on GenAI-bench and the DrawBench. 3. The idea of introducing databases for image correction is interesting.

Weaknesses

1. Novelty and Baselines - While the paper is technically sound with performance gain, the idea of plug-and-play pipelines for diffusion generation is not a new idea. For instance, [1, 2] proves that using LLM object planning can already fix negative prompts and improve prompting a lot in a multi-round fashion. [3] then extends this idea with VLM modules, which is close to the paper's agentic setting already. I believe that these papers worth discussions and even be the baseline in Table 1. The

Reviewer 03Rating 4Confidence 4

Strengths

1.**Originality of the Proposed Memory System:** The introduction of Theory-of-Mind-inspired self-improving memory system is novel and represents a direction that has been rarely explored in diffusion models. 2.**Intuitive and Effective Framework:** The proposed generate → evaluate → edit paradigm is intuitive, simple, and empirically effective, which I find to be one of the most convincing aspects of this work.

Weaknesses

1.**Limited Novelty:** The proposed local editing after image generation is intuitively effective but lacks clear academic novelty or theoretical depth. Its practical applicability is also limited, since Qwen-Image/Edit itself is computationally expensive, making the overall framework inefficient for real-world deployment. 2.**Insufficient Ablation Studies:** The paper lacks key ablations to isolate the contribution of each major component, particularly the memory module and local editing mecha

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning