RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
Tanqiu Jiang, Zian Wang, Jiacheng Liang, Changjiang Li, Yuhui Wang,, Ting Wang

TL;DR
RobustKV is a defense mechanism for large language models that evicts critical tokens from key-value caches to prevent jailbreak attacks, effectively reducing harmful query influence while maintaining performance on benign inputs.
Contribution
It introduces a novel token eviction strategy based on attention importance to defend against adaptive jailbreak attacks in LLMs.
Findings
RobustKV effectively counters state-of-the-art jailbreak attacks.
Maintains performance on benign queries.
Creates an evasiveness trade-off for adversaries.
Abstract
Jailbreak attacks circumvent LLMs' built-in safeguards by concealing harmful queries within jailbreak prompts. While existing defenses primarily focus on mitigating the effects of jailbreak prompts, they often prove inadequate as jailbreak prompts can take arbitrary, adaptive forms. This paper presents RobustKV, a novel defense that adopts a fundamentally different approach by selectively removing critical tokens of harmful queries from key-value (KV) caches. Intuitively, for a jailbreak prompt to be effective, its tokens must achieve sufficient `importance' (as measured by attention scores), which inevitably lowers the importance of tokens in the concealed harmful query. Thus, by strategically evicting the KVs of the lowest-ranked tokens, RobustKV diminishes the presence of the harmful query in the KV cache, thus preventing the LLM from generating malicious responses. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics
MethodsSoftmax · Attention Is All You Need · Focus
