Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang, Qingyi Gu, Zhen Dong

TL;DR
ArenaPO introduces a novel offline reward method using Arena scores for fine-grained preference optimization in diffusion models, enhancing efficiency and performance without requiring a reward model.
Contribution
It proposes ArenaPO, leveraging Arena scores as offline rewards for efficient, fine-grained preference optimization in diffusion models without a reward model.
Findings
ArenaPO outperforms existing baselines on Pick-a-Pic v2 and HPD v3 datasets.
It achieves fine-grained optimization without additional training overhead.
The method effectively combines the benefits of RLHF and DPO.
Abstract
Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
