Quantifying non deterministic drift in large language models
Claire Nicholson

TL;DR
This paper empirically quantifies the inherent output variability of large language models across different settings, revealing persistent nondeterminism even at zero temperature and providing a baseline for future stability improvements.
Contribution
It systematically measures and compares output drift in LLMs under various conditions, establishing a baseline for nondeterminism without stabilization techniques.
Findings
Nondeterminism persists at temperature 0.0.
Variability patterns differ by model size and prompt type.
Lexical metrics have limitations, suggesting semantic approaches.
Abstract
Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification
