Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

Jiayi Yuan; Hao Li; Xinheng Ding; Wenya Xie; Yu-Jhe Li; Wentian Zhao; Kun Wan; Jing Shi; Xia Hu; Zirui Liu

arXiv:2506.09501·cs.CL·October 28, 2025

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu

PDF

1 Video

TL;DR

This paper investigates how numerical precision and system configurations affect the reproducibility of LLM inference, revealing significant variability and proposing a stable inference pipeline called LayerCast.

Contribution

It provides the first systematic analysis of numerical sources of nondeterminism in LLM inference and introduces LayerCast to improve reproducibility.

Findings

01

GPU configuration impacts LLM output consistency

02

Floating-point precision influences response divergence

03

LayerCast balances memory efficiency with numerical stability

Abstract

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference· slideslive