Language models recognize dropout and Gaussian noise applied to their activations
Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, Oliver Richardson

TL;DR
This paper demonstrates that large language models can detect and distinguish between dropout and Gaussian noise perturbations in their activations, revealing an inherent awareness of such modifications.
Contribution
It shows that models from the Llama, Olmo, and Qwen families can recognize, localize, and verbalize different types of activation perturbations, even in zero-shot settings.
Findings
Models can detect and localize perturbations with high accuracy.
Qwen3-32B's accuracy improves with perturbation strength.
Models can learn to distinguish between dropout and Gaussian noise.
Abstract
We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) mask activations, simulating dropout, or (b) add Gaussian noise to them, at a target sentence. We then ask a multiple-choice question such as "Which of the previous sentences was perturbed?" or "Which of the two perturbations was applied?". We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, Qwen3-32B's zero-shot accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
