Test-time Alignment of Diffusion Models without Reward Over-optimization

Sunwoo Kim; Minkyu Kim; Dongmin Park

arXiv:2501.05803·cs.LG·April 18, 2025

Test-time Alignment of Diffusion Models without Reward Over-optimization

Sunwoo Kim, Minkyu Kim, Dongmin Park

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a training-free, test-time method using Sequential Monte Carlo to align diffusion models with specific objectives, avoiding reward over-optimization and maintaining diversity.

Contribution

It presents a novel test-time alignment technique for diffusion models that does not require fine-tuning and effectively handles multiple objectives and online optimization.

Findings

01

Achieves comparable or better rewards than fine-tuning methods.

02

Preserves diversity and cross-reward generalization.

03

Effective in single, multi-objective, and online settings.

Abstract

Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 5Confidence 4

Strengths

1. DAS does not require additional training, which reduces computational cost. 2.The use of SMC with tempering is justified through asymptotic properties. 3. DAS balances reward optimization and diversity, and is demonstrated across single-reward, multi-objective, and online settings.

Weaknesses

1. While DAS is compared with fine-tuning and guidance methods, comparisons to baselines like STEGANODE or controlled diffusion could have strengthened the evaluation. 2. DAS assumes differentiable reward functions, which may limit applicability in scenarios involving non-differentiable objectives. 3. Most experiments use Stable Diffusion v1.5, and additional models would have enhanced the generality of the findings. 4. The paper can do more image tasks. Currently it emphasizes findings on aesth

Reviewer 02Rating 8Confidence 3

Strengths

1. This paper is overall well-written and the motivation is clear. It aims to address the trade-off in diffusion models that align them with specific objectives while maintaining their versatility, which is a critical problem in generative modeling. 2. DAS’s effectiveness is comprehensively validated across diverse scenarios, including toy distribution simulation, single-reward, multi-objective, and online black-box optimization tasks.

Weaknesses

1. More intuitive explanations of SMC are suggested to add between the motivation and method to make it more consistent and intuitive since the introduction of SMC in supplementary material is a bit abstruse to understand, making the superiority of adopting SMC to address the training problem unclear. 2. How to choose hyperparameters such as $\gamma, \alpha$ and particles should be discussed across different scenarios.

Reviewer 03Rating 8Confidence 3

Strengths

- The introduction provides a clear overview of the problem. - The proposed method appears promising and might be innovative (see Question 5.)

Weaknesses

- The choice of finetuning-based RLHF baselines may not be appropriate (see Question 1). - The paper is sometimes hard to follow due to the delayed definition of new notations. For instance, the symbol $\gamma$ is used on line 208 but is not defined until line 250. - The evaluation metrics used in the paper (line 355 and onward) are not explained, making it difficult to assess their relevance and meaning.

Code & Models

Repositories

krafton-ai/das
pytorchOfficial

Videos

Test-time Alignment of Diffusion Models without Reward Over-optimization· slideslive

Taxonomy

TopicsModel Reduction and Neural Networks

MethodsDiffusion