Protecting Your LLMs with Information Bottleneck
Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun, Wang, Chunlin Chen, Wei Cheng, Jiang Bian

TL;DR
This paper introduces IBProtector, a novel defense mechanism based on the information bottleneck principle, which compresses and perturbs prompts to protect large language models from harmful or jailbreaking attacks while maintaining response quality.
Contribution
The paper proposes IBProtector, a new, transferable defense method that effectively mitigates jailbreak attacks without modifying the underlying LLMs or significantly impacting performance.
Findings
IBProtector outperforms existing defenses against jailbreaks
It maintains response quality and inference speed
Effective across various attack types and models
Abstract
The advent of large language models (LLMs) has revolutionized the field of natural language processing, yet they might be attacked to produce harmful content. Despite efforts to ethically align LLMs, these are often fragile and can be circumvented by jailbreaking attacks through optimized or manual adversarial prompts. To address this, we introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle, and we modify the objective to avoid trivial solutions. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor, preserving only essential information for the target LLMs to respond with the expected answer. Moreover, we further consider a situation where the gradient is not visible to be compatible with any LLM. Our empirical evaluations show that IBProtector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security · Cryptography and Data Security · Blockchain Technology Applications and Security
MethodsALIGN
