Trojan Activation Attack: Red-Teaming Large Language Models using   Activation Steering for Safety-Alignment

Haoran Wang; Kai Shu

arXiv:2311.09433·cs.CR·August 19, 2024·1 cites

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

Haoran Wang, Kai Shu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Trojan Activation Attack (TA^2), a novel method to stealthily manipulate large language models by injecting trojan steering vectors into activation layers, effectively compromising safety alignment without significant overhead.

Contribution

The paper presents a new activation-based attack method on LLMs that is highly effective, stealthy, and computationally efficient, expanding the understanding of safety vulnerabilities.

Findings

01

TA^2 effectively manipulates LLM behaviors at inference.

02

The attack maintains high stealthiness and low detection risk.

03

It requires minimal additional computational resources.

Abstract

To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wang2226/backdoor-activation-attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning