SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

Xiaoxuan He; Siming Fu; Wanli Li; Zhiyuan Li; Dacheng Yin; Kang Rong; Fengyun Rao; Bo Zhang

arXiv:2602.05380·cs.CV·February 12, 2026

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, Bo Zhang

PDF

Open Access 3 Reviews

TL;DR

SAIL is a novel framework enabling diffusion models to self-improve and align with human preferences using minimal feedback, eliminating the need for large datasets or reward models, and outperforming existing methods.

Contribution

The paper introduces SAIL, a self-iterative learning approach allowing diffusion models to self-align with human preferences using minimal data without external reward models.

Findings

01

SAIL outperforms state-of-the-art methods with only 6% of the preference data.

02

Diffusion models can self-annotate and improve without large-scale human-labeled datasets.

03

The ranked preference mixup strategy enhances robust self-improvement.

Abstract

Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper proposes a self-improving framework to align diffusion models with human preferences without large-scale annotated datasets, which is novel. 2. Thorough empirical evaluation results show that the proposed method is effective, outperforming existing alignment methods using only 6% of the annotations.

Weaknesses

1. Lack of Theoretical Guarantees. While the reward formulation (Eq. 8–9) is mathematically correct in the DiffusionDPO framework, the paper does not provide theoretical analysis of what distribution SAIL converges to. What is the target distribution of this method? Is it the same as DPO? If so, what explains the performance with fewer annotations? It's unclear where the observed gains come from. Whether and why the self-reward metric aligns with true human preference distributions? A thorough

Reviewer 02Rating 4Confidence 3

Strengths

- Using self-rewarding to rank online data is relatively new in text-to-image generation. - The proposed method works well with limited data.

Weaknesses

- Implicit reward is adopted from previous work in LLM. - The mixup of online and initial preference data is straightforward. - Some generated images seem to have a color saturation problem. - There are a lot of problems in writing, e.g. - Line 144-145: grammar error - Line 145: some -> Some - Line 344: Pic-a-Pic -> Pick-a-Pic - Line 357: use -> uses - Line 362: bringing challenge -> brings challenges - Line 373: fix reference - Line 374-375: Thus, …, so … - Line 454: reveals ->

Reviewer 03Rating 8Confidence 2

Strengths

1. The idea of fully exploiting the base model's potential is innovative. By eliciting this potential through the proposed iterative self-improvement method SAIL, the authors achieve comparable or better preference performance using only 6% of the human preference data. 2. The consistent improvement across multiple iterations further demonstrates the effectiveness of the proposed iterative self-improvement paradigm.

Weaknesses

1. The tables lack multiple trials and confidence intervals, which are necessary to demonstrate the statistical significance of performance improvements and validate the effectiveness of the algorithm design in ablation studies. 2. It would be valuable for the authors to include results over a larger range of iterations to illustrate the performance trajectory and reveal how the improvement trend evolves as the number of iterations increases.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Recommender Systems and Techniques · Multimodal Machine Learning Applications