Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng Wu, Xiao Yu

TL;DR
This paper investigates the Training-Inference Mismatch (TIM) in LLM reinforcement learning, showing that small numerical disagreements can cause training collapse and proposing remedies to improve stability.
Contribution
It isolates TIM in a diagnostic setting, demonstrates its impact on training stability, and suggests system-level solutions to mitigate this issue.
Findings
Small token-level disagreements can cause training collapse.
TIM alters the effective optimization problem.
Remedies can mitigate the effects of TIM.
Abstract
Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
