Embedding-perturbed Exploration Preference Optimization for Flow Models
Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li

TL;DR
This paper introduces E^2PO, a novel reinforcement learning framework that maintains sample diversity through embedding perturbations, leading to more stable training and better alignment with human preferences.
Contribution
E^2PO is the first method to use embedding-level perturbations to sustain variance and improve stability in group-based optimization for generative models.
Findings
E^2PO outperforms existing methods in aligning with human preferences.
Embedding perturbations prevent variance decay and stabilize training.
Significant improvements demonstrated in extensive experiments.
Abstract
Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose , a novel framework that sustains optimization through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
