TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
Jiaming Li, Chenyu Zhu, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou, Zhiyuan Ma

TL;DR
TMPO introduces a trajectory-level reward distribution matching approach with a Softmax-TB objective, enhancing diversity and efficiency in diffusion model alignment tasks compared to reward maximization methods.
Contribution
The paper proposes TMPO, a novel trajectory matching policy optimization method that improves diversity and reduces reward hacking in diffusion model alignment.
Findings
TMPO improves generative diversity by 9.1% over state-of-the-art methods.
TMPO achieves a better trade-off between reward and diversity.
Dynamic Stochastic Tree Sampling reduces training time while maintaining performance.
Abstract
Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
