Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse   Representation Adjustment in Large Language Models

Guobin Shen; Dongcheng Zhao; Yiting Dong; Xiang He; Yi Zeng

arXiv:2410.02298·cs.CR·February 10, 2025·3 cites

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen, Dongcheng Zhao, Yiting Dong, Xiang He, Yi Zeng

PDF

Open Access

TL;DR

Jailbreak Antidote is a real-time method that adjusts a small subset of internal model states to balance safety and utility in large language models, effectively defending against jailbreak attacks without added latency.

Contribution

It introduces a sparse internal state adjustment technique for LLMs that enables dynamic safety control during inference, improving safety without sacrificing utility or efficiency.

Findings

01

Adjusting about 5% of internal states suffices for safety control.

02

The method is effective across nine diverse LLMs.

03

It outperforms existing defenses in efficiency and flexibility.

Abstract

As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Topic Modeling