RobustKV: Defending Large Language Models against Jailbreak Attacks via   KV Eviction

Tanqiu Jiang; Zian Wang; Jiacheng Liang; Changjiang Li; Yuhui Wang,; Ting Wang

arXiv:2410.19937·cs.CR·October 29, 2024

RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction

Tanqiu Jiang, Zian Wang, Jiacheng Liang, Changjiang Li, Yuhui Wang,, Ting Wang

PDF

Open Access 1 Video

TL;DR

RobustKV is a defense mechanism for large language models that evicts critical tokens from key-value caches to prevent jailbreak attacks, effectively reducing harmful query influence while maintaining performance on benign inputs.

Contribution

It introduces a novel token eviction strategy based on attention importance to defend against adaptive jailbreak attacks in LLMs.

Findings

01

RobustKV effectively counters state-of-the-art jailbreak attacks.

02

Maintains performance on benign queries.

03

Creates an evasiveness trade-off for adversaries.

Abstract

Jailbreak attacks circumvent LLMs' built-in safeguards by concealing harmful queries within jailbreak prompts. While existing defenses primarily focus on mitigating the effects of jailbreak prompts, they often prove inadequate as jailbreak prompts can take arbitrary, adaptive forms. This paper presents RobustKV, a novel defense that adopts a fundamentally different approach by selectively removing critical tokens of harmful queries from key-value (KV) caches. Intuitively, for a jailbreak prompt to be effective, its tokens must achieve sufficient `importance' (as measured by attention scores), which inevitably lowers the importance of tokens in the concealed harmful query. Thus, by strategically evicting the KVs of the lowest-ranked tokens, RobustKV diminishes the presence of the harmful query in the KV cache, thus preventing the LLM from generating malicious responses. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction· slideslive

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics

MethodsSoftmax · Attention Is All You Need · Focus