Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

Zhikai Li; Yue Zhao; Edward Zhongwei Zhang; Xuewen Liu; Jing Zhang; Qingyi Gu; Zhen Dong

arXiv:2605.06070·cs.CV·May 8, 2026

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang, Qingyi Gu, Zhen Dong

PDF

TL;DR

ArenaPO introduces a novel offline reward method using Arena scores for fine-grained preference optimization in diffusion models, enhancing efficiency and performance without requiring a reward model.

Contribution

It proposes ArenaPO, leveraging Arena scores as offline rewards for efficient, fine-grained preference optimization in diffusion models without a reward model.

Findings

01

ArenaPO outperforms existing baselines on Pick-a-Pic v2 and HPD v3 datasets.

02

It achieves fine-grained optimization without additional training overhead.

03

The method effectively combines the benefits of RLHF and DPO.

Abstract

Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.