Multi-dimensional Preference Alignment by Conditioning Reward Itself
Jiho Jang, Jinyoung Kim, Kyungjune Baek, Nojun Kwak

TL;DR
This paper introduces MCDPO, a novel reinforcement learning method that disentangles multiple reward axes during training, enabling independent optimization and dynamic control at inference time, improving alignment of diffusion models.
Contribution
The paper proposes Multi Reward Conditional DPO (MCDPO), which addresses reward conflicts by conditioning on preference vectors and introduces reward dropout for balanced multi-dimensional optimization.
Findings
MCDPO outperforms existing methods on Stable Diffusion benchmarks.
The conditional framework allows dynamic, multi-axis control during inference.
MCDPO maintains desirable features while optimizing for multiple reward dimensions.
Abstract
Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Emotion and Mood Recognition · Recommender Systems and Techniques
