Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Ali Raza; Gurang Gupta; Nikolay Matyunin; Jibesh Patra

arXiv:2603.10080·cs.CR·March 18, 2026

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Ali Raza, Gurang Gupta, Nikolay Matyunin, Jibesh Patra

PDF

Open Access

TL;DR

This paper introduces Amnesia, an adversarial attack method that manipulates internal states of large language models to bypass safety measures, revealing vulnerabilities in current safety mechanisms and emphasizing the need for more robust defenses.

Contribution

The study presents a novel activation-space adversarial attack called Amnesia that effectively bypasses existing safety safeguards in open-weight LLMs without additional training.

Findings

01

Amnesia successfully circumvents safety mechanisms in state-of-the-art LLMs.

02

The attack induces antisocial behaviors in models on benchmark datasets.

03

Current safety measures are insufficient against internal activation manipulations.

Abstract

Warning: This article includes red-teaming experiments, which contain examples of compromised LLM responses that may be offensive or upsetting. Large Language Models (LLMs) have the potential to create harmful content, such as generating sophisticated phishing emails and assisting in writing code of harmful computer viruses. Thus, it is crucial to ensure their safe and responsible response generation. To reduce the risk of generating harmful or irresponsible content, researchers have developed techniques such as reinforcement learning with human feedback to align LLM's outputs with human values and preferences. However, it is still undetermined whether such measures are sufficient to prevent LLMs from generating interesting responses. In this study, we propose Amnesia, a lightweight activation-space adversarial attack that manipulates internal transformer states to bypass existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Hate Speech and Cyberbullying Detection