Defeating the Training-Inference Mismatch via FP16
Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

TL;DR
This paper shows that switching from BF16 to FP16 floating point precision in RL fine-tuning of large language models reduces training-inference mismatch, leading to more stable, faster, and better performance.
Contribution
The study reveals that FP16 precision, rather than BF16, effectively resolves training-inference mismatch issues in RL fine-tuning of LLMs, with minimal implementation effort.
Findings
FP16 reduces training-inference mismatch
FP16 improves training stability and convergence
FP16 enhances model performance across tasks
Abstract
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
