SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention

Jiaqi Wu; Chen Chen; Chunyan Hou; Xiaojie Yuan

arXiv:2502.15594·cs.CL·May 26, 2025

SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention

Jiaqi Wu, Chen Chen, Chunyan Hou, Xiaojie Yuan

PDF

TL;DR

SafeInt is a novel safety-aware representation intervention method that effectively shields large language models from jailbreak attacks by relocating harmful representations into rejection regions, maintaining utility and robustness.

Contribution

We introduce SafeInt, a dynamic representation intervention technique that improves LLM safety against jailbreak attacks by aligning malicious representations with unsafe regions.

Findings

01

Outperforms baseline defenses against six jailbreak attacks

02

Maintains high utility while defending against attacks

03

Effective against adaptive, real-time jailbreak attempts

Abstract

With the widespread real-world deployment of large language models (LLMs), ensuring their behavior complies with safety standards has become crucial. Jailbreak attacks exploit vulnerabilities in LLMs to induce undesirable behavior, posing a significant threat to LLM safety. Previous defenses often fail to achieve both effectiveness and efficiency simultaneously. Defenses from a representation perspective offer new insights, but existing interventions cannot dynamically adjust representations based on the harmfulness of the queries. To address this limitation, we propose SafeIntervention (SafeInt), a novel defense method that shields LLMs from jailbreak attacks through safety-aware representation intervention. Built on our analysis of the representations of jailbreak samples, the core idea of SafeInt is to relocate jailbreak-related representations into the rejection region. This is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN