The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
David Pape, Jonathan Evertz, Lea Sch\"onherr

TL;DR
This paper investigates how different inference backends significantly impact the reproducibility and benchmarking results of large language models, highlighting the need for standardized reporting.
Contribution
It systematically analyzes the influence of inference backends on LLM evaluation metrics and advocates for standardized reporting practices.
Findings
Inference backend choice can cause score shifts up to 16.6 percentage points.
Backend optimizations like prefix caching and CUDA graphs drive output divergence.
The inference stack is rarely reported despite its impact on results.
Abstract
Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
