The Instability of Safety: How Random Seeds and Temperature Expose Inconsistent LLM Refusal Behavior

Erik Larsen

arXiv:2512.12066·cs.LG·December 17, 2025

The Instability of Safety: How Random Seeds and Temperature Expose Inconsistent LLM Refusal Behavior

Erik Larsen

PDF

Open Access

TL;DR

This paper demonstrates that large language models exhibit significant variability in safety refusal behavior across different random seeds and temperature settings, challenging the reliability of single-shot safety evaluations.

Contribution

It introduces the Safety Stability Index (SSI) to quantify response consistency and shows the importance of multi-sample evaluation protocols for accurate safety assessment.

Findings

01

18-28% of prompts show decision flips depending on sampling configuration

02

Higher temperatures significantly decrease safety decision stability

03

Single-shot evaluations are only 92.4% reliable compared to multi-sample methods

Abstract

Current safety evaluations of large language models rely on single-shot testing, implicitly assuming that model responses are deterministic and representative of the model's safety alignment. We challenge this assumption by investigating the stability of safety refusal decisions across random seeds and temperature settings. Testing four instruction-tuned models from three families (Llama 3.1 8B, Qwen 2.5 7B, Qwen 3 8B, Gemma 3 12B) on 876 harmful prompts across 20 different sampling configurations (4 temperatures x 5 random seeds), we find that 18-28% of prompts exhibit decision flips--the model refuses in some configurations but complies in others--depending on the model. Our Safety Stability Index (SSI) reveals that higher temperatures significantly reduce decision stability (Friedman chi-squared = 396.81, p < 0.001), with mean within-temperature SSI dropping from 0.977 at temperature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)