TL;DR
This paper identifies the failure of policy entropy regularization in flow-based RLHF for preserving diversity, introduces perceptual entropy as a solution, and demonstrates improved diversity and quality in experiments.
Contribution
The paper proposes perceptual entropy to better measure and preserve diversity in flow models, overcoming the limitations of policy entropy regularization.
Findings
Perceptual entropy maintains diversity where policy entropy fails.
Perceptual entropy-based strategies improve quality-diversity trade-off.
Experiments show significant gains in diversity and quality metrics.
Abstract
RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
