TL;DR
This paper introduces Sol-RL, a novel FP4-accelerated reinforcement learning framework that scales diffusion model training efficiently while maintaining high fidelity, leading to faster convergence and better alignment.
Contribution
It proposes a two-stage FP4-based RL method that decouples candidate exploration from policy optimization, enabling efficient large-scale rollout scaling without performance loss.
Findings
Accelerates training convergence by up to 4.64 times.
Maintains training integrity with FP4 quantization.
Achieves superior alignment across multiple diffusion models.
Abstract
Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling· youtube
