The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

David Pape; Jonathan Evertz; Lea Sch\"onherr

arXiv:2605.19537·cs.LG·May 21, 2026

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

David Pape, Jonathan Evertz, Lea Sch\"onherr

PDF

TL;DR

This paper investigates how different inference backends significantly impact the reproducibility and benchmarking results of large language models, highlighting the need for standardized reporting.

Contribution

It systematically analyzes the influence of inference backends on LLM evaluation metrics and advocates for standardized reporting practices.

Findings

01

Inference backend choice can cause score shifts up to 16.6 percentage points.

02

Backend optimizations like prefix caching and CUDA graphs drive output divergence.

03

The inference stack is rarely reported despite its impact on results.

Abstract

Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.