The Instability of Safety: How Random Seeds and Temperature Expose Inconsistent LLM Refusal Behavior
Erik Larsen

TL;DR
This paper demonstrates that large language models exhibit significant variability in safety refusal behavior across different random seeds and temperature settings, challenging the reliability of single-shot safety evaluations.
Contribution
It introduces the Safety Stability Index (SSI) to quantify response consistency and shows the importance of multi-sample evaluation protocols for accurate safety assessment.
Findings
18-28% of prompts show decision flips depending on sampling configuration
Higher temperatures significantly decrease safety decision stability
Single-shot evaluations are only 92.4% reliable compared to multi-sample methods
Abstract
Current safety evaluations of large language models rely on single-shot testing, implicitly assuming that model responses are deterministic and representative of the model's safety alignment. We challenge this assumption by investigating the stability of safety refusal decisions across random seeds and temperature settings. Testing four instruction-tuned models from three families (Llama 3.1 8B, Qwen 2.5 7B, Qwen 3 8B, Gemma 3 12B) on 876 harmful prompts across 20 different sampling configurations (4 temperatures x 5 random seeds), we find that 18-28% of prompts exhibit decision flips--the model refuses in some configurations but complies in others--depending on the model. Our Safety Stability Index (SSI) reveals that higher temperatures significantly reduce decision stability (Friedman chi-squared = 396.81, p < 0.001), with mean within-temperature SSI dropping from 0.977 at temperature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
