Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Prithviraj Singh Shahani; Kaveh Eskandari Miandoab; and Matthias Scheutz

arXiv:2505.13500·cs.CL·October 14, 2025

Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Prithviraj Singh Shahani, Kaveh Eskandari Miandoab, and Matthias Scheutz

PDF

Open Access

TL;DR

This paper demonstrates that injecting Gaussian noise into large language models' activations significantly undermines safety guardrails, exposing vulnerabilities in current safety alignment methods and emphasizing the need for more robust solutions.

Contribution

It systematically evaluates the robustness of safety fine-tuning in LLMs under noise perturbations, revealing critical vulnerabilities and guiding future safety improvements.

Findings

01

Gaussian noise increases harmful outputs by up to 27%

02

Deeper safety fine-tuning does not improve robustness

03

Chain-of-thought reasoning remains largely unaffected

Abstract

Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraffic Prediction and Management Techniques · Infrastructure Maintenance and Monitoring