Uncovering Safety Risks of Large Language Models through Concept   Activation Vector

Zhihao Xu; Ruixuan Huang; Changyu Chen; Xiting Wang

arXiv:2404.12038·cs.CL·December 3, 2024·2 cites

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Zhihao Xu, Ruixuan Huang, Changyu Chen, Xiting Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the SCAV framework to interpret and attack large language models' safety mechanisms, revealing significant safety risks and transferability of attacks across models.

Contribution

We propose a novel Safety Concept Activation Vector (SCAV) framework and an SCAV-guided attack method to improve attack success rates and interpret LLM safety mechanisms.

Findings

01

Attack success rate of 99.14% on seven open-source LLMs

02

Generated attack prompts transfer to GPT-4

03

Embedding-level attacks transfer to other white-box LLMs

Abstract

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sproutnan/ai-safety_scav
pytorchOfficial

Videos

Uncovering Safety Risks of Large Language Models through Concept Activation Vector· slideslive

Taxonomy

TopicsSoftware Engineering Research · Technology and Data Analysis · Information and Cyber Security

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer