Noise Injection Systemically Degrades Large Language Model Safety Guardrails
Prithviraj Singh Shahani, Kaveh Eskandari Miandoab, and Matthias Scheutz

TL;DR
This paper demonstrates that injecting Gaussian noise into large language models' activations significantly undermines safety guardrails, exposing vulnerabilities in current safety alignment methods and emphasizing the need for more robust solutions.
Contribution
It systematically evaluates the robustness of safety fine-tuning in LLMs under noise perturbations, revealing critical vulnerabilities and guiding future safety improvements.
Findings
Gaussian noise increases harmful outputs by up to 27%
Deeper safety fine-tuning does not improve robustness
Chain-of-thought reasoning remains largely unaffected
Abstract
Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraffic Prediction and Management Techniques · Infrastructure Maintenance and Monitoring
