TL;DR
This paper investigates how numerical precision and system configurations affect the reproducibility of LLM inference, revealing significant variability and proposing a stable inference pipeline called LayerCast.
Contribution
It provides the first systematic analysis of numerical sources of nondeterminism in LLM inference and introduces LayerCast to improve reproducibility.
Findings
GPU configuration impacts LLM output consistency
Floating-point precision influences response divergence
LayerCast balances memory efficiency with numerical stability
Abstract
Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
