FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Yitong Li; Junsong Chen; Shuchen Xue; Pengcuo Zeren; Siyuan Fu; Dinghao Yang; Yangyang Tang; Junjie Bai; Ping Luo; Song Han; Enze Xie

arXiv:2604.06916·cs.LG·April 9, 2026

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie

PDF

1 Video

TL;DR

This paper introduces Sol-RL, a novel FP4-accelerated reinforcement learning framework that scales diffusion model training efficiently while maintaining high fidelity, leading to faster convergence and better alignment.

Contribution

It proposes a two-stage FP4-based RL method that decouples candidate exploration from policy optimization, enabling efficient large-scale rollout scaling without performance loss.

Findings

01

Accelerates training convergence by up to 4.64 times.

02

Maintains training integrity with FP4 quantization.

03

Achieves superior alignment across multiple diffusion models.

Abstract

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling· youtube