Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance
Zhou Jiang, Yandong Wen, Zhen Liu

TL;DR
This paper introduces a novel classifier-free guidance approach for aligning diffusion models with human preferences, improving control and generalization without retraining the base model.
Contribution
It proposes a contrastive guidance method that decouples preference learning into positive and negative modules, enhancing alignment and generalization in diffusion models.
Findings
Consistent quantitative improvements on multiple datasets.
Qualitative results show sharper, more aligned images.
Method does not require retraining the base diffusion model.
Abstract
Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The idea of reinterpreting preference alignment in diffusion models through CFG is interesting. While CFG is typically used for conditioning on class labels or prompts, applying it to preference signals is both novel and practically useful. This reframing also enables plug-and-play alignment. 2. Presentation is clear, well-structured, and easy to follow. The paper does a good job of walking the reader through the standard diffusion and DPO setups before introducing PGD and cPGD. Figures are
1. Even though the proposed PGD/cPGD are novel reinterpretations of preference alignment, the intuition is not well-explained. The paper lacks a formal verification of these methods. For example, what is the target distribution/posterior distribution that PGD/cPGD is trying to sample from? Can the authors justify that this interpolated guidance signal (between base and finetuned models) actually approximates the posterior over preferred samples? 2. The intuition behind cPGD is also underspecifi
1. This paper proposes two inference-time preference alignment methods inspired by the analogy to Classifier-Free Guidance (CFG), both conceptually simple yet effective. 2. PGD is a practical approach that can reuse existing pretrained and preference-finetuned weights. 3. The experimental results are extensive — across two prompt sets and multiple evaluation metrics, the effectiveness of the proposed methods is consistently demonstrated.
1. The proposed PGD formulation (Eq. 9) is merely defined by analogy to CFG, and the equation itself lacks theoretical justification. 2. The reason why overfitting is mitigated in cPGD is not experimentally verified (although a theoretical explanation is mentioned in Section 4.2). 3. Manual tuning of the guidance weight is required — it must be adjusted for each evaluation metric or dataset.
1. The method is conceptually clear and intuitive — explicitly separating the contributions of positive and negative distributions provides a simple yet effective perspective on alignment. 2. The approach is easy to implement and has immediate practical value for production-level diffusion models. 3. The reported results are promising and demonstrate the potential of this straightforward modification.
1. Relying on two separate models is a major limitation. It is easy to imagine adding an additional conditioning to diffusion model and fine-tuning a single model instead, which would avoid the extra inference and memory overhead introduced by maintaining two networks. 2. Building on the previous point, the paper feels somewhat incomplete. It lacks deeper analysis or discussion of why the proposed method works. For example, does DPO actually fail to reduce the winner’s loss or increase the lose
The method is straightforward and well-motivated. The presentation is clear. The experiments are comprehensive in terms of the metric covered, base model used, and baselines compared. The author also provided several ablation analyses on various design choices such as weighting.
There are three main concerns in terms of contribution and technical correctness. First, the author assumes there are disjoint sets of positive samples and negative samples from a preference dataset, which is not true. For example, in Pick-a-Pick dataset, there are multiple images generated per prompt (say A,B,C,D), and there are preferences A > B and B > C, In such case, the positive and negative samples overlap with each other. This is fine for standard DPO method because its loss contrast **
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Recommender Systems and Techniques · Generative Adversarial Networks and Image Synthesis
