Protecting Your LLMs with Information Bottleneck

Zichuan Liu; Zefan Wang; Linjie Xu; Jinyu Wang; Lei Song; Tianchun; Wang; Chunlin Chen; Wei Cheng; Jiang Bian

arXiv:2404.13968·cs.CL·October 11, 2024·1 cites

Protecting Your LLMs with Information Bottleneck

Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun, Wang, Chunlin Chen, Wei Cheng, Jiang Bian

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces IBProtector, a novel defense mechanism based on the information bottleneck principle, which compresses and perturbs prompts to protect large language models from harmful or jailbreaking attacks while maintaining response quality.

Contribution

The paper proposes IBProtector, a new, transferable defense method that effectively mitigates jailbreak attacks without modifying the underlying LLMs or significantly impacting performance.

Findings

01

IBProtector outperforms existing defenses against jailbreaks

02

It maintains response quality and inference speed

03

Effective across various attack types and models

Abstract

The advent of large language models (LLMs) has revolutionized the field of natural language processing, yet they might be attacked to produce harmful content. Despite efforts to ethically align LLMs, these are often fragile and can be circumvented by jailbreaking attacks through optimized or manual adversarial prompts. To address this, we introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle, and we modify the objective to avoid trivial solutions. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor, preserving only essential information for the target LLMs to respond with the expected answer. Moreover, we further consider a situation where the gradient is not visible to be compatible with any LLM. Our empirical evaluations show that IBProtector…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Protecting Your LLMs with Information Bottleneck· slideslive

Taxonomy

TopicsDigital Rights Management and Security · Cryptography and Data Security · Blockchain Technology Applications and Security

MethodsALIGN