SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu, Chen Chen, Chunyan Hou, Xiaojie Yuan

TL;DR
SafeInt is a novel safety-aware representation intervention method that effectively shields large language models from jailbreak attacks by relocating harmful representations into rejection regions, maintaining utility and robustness.
Contribution
We introduce SafeInt, a dynamic representation intervention technique that improves LLM safety against jailbreak attacks by aligning malicious representations with unsafe regions.
Findings
Outperforms baseline defenses against six jailbreak attacks
Maintains high utility while defending against attacks
Effective against adaptive, real-time jailbreak attempts
Abstract
With the widespread real-world deployment of large language models (LLMs), ensuring their behavior complies with safety standards has become crucial. Jailbreak attacks exploit vulnerabilities in LLMs to induce undesirable behavior, posing a significant threat to LLM safety. Previous defenses often fail to achieve both effectiveness and efficiency simultaneously. Defenses from a representation perspective offer new insights, but existing interventions cannot dynamically adjust representations based on the harmfulness of the queries. To address this limitation, we propose SafeIntervention (SafeInt), a novel defense method that shields LLMs from jailbreak attacks through safety-aware representation intervention. Built on our analysis of the representations of jailbreak samples, the core idea of SafeInt is to relocate jailbreak-related representations into the rejection region. This is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
