Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models
Ali Raza, Gurang Gupta, Nikolay Matyunin, Jibesh Patra

TL;DR
This paper introduces Amnesia, an adversarial attack method that manipulates internal states of large language models to bypass safety measures, revealing vulnerabilities in current safety mechanisms and emphasizing the need for more robust defenses.
Contribution
The study presents a novel activation-space adversarial attack called Amnesia that effectively bypasses existing safety safeguards in open-weight LLMs without additional training.
Findings
Amnesia successfully circumvents safety mechanisms in state-of-the-art LLMs.
The attack induces antisocial behaviors in models on benchmark datasets.
Current safety measures are insufficient against internal activation manipulations.
Abstract
Warning: This article includes red-teaming experiments, which contain examples of compromised LLM responses that may be offensive or upsetting. Large Language Models (LLMs) have the potential to create harmful content, such as generating sophisticated phishing emails and assisting in writing code of harmful computer viruses. Thus, it is crucial to ensure their safe and responsible response generation. To reduce the risk of generating harmful or irresponsible content, researchers have developed techniques such as reinforcement learning with human feedback to align LLM's outputs with human values and preferences. However, it is still undetermined whether such measures are sufficient to prevent LLMs from generating interesting responses. In this study, we propose Amnesia, a lightweight activation-space adversarial attack that manipulates internal transformer states to bypass existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Hate Speech and Cyberbullying Detection
