Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Xin Yi; Yue Li; Dongsheng Shi; Linlin Wang; Xiaoling Wang; Liang He

arXiv:2501.10639·cs.CR·June 2, 2025

Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, Liang He

PDF

Open Access 1 Repo

TL;DR

This paper presents LATPC, a novel defense framework for large language models that enhances safety against jailbreak attacks while maintaining utility, by dynamically identifying safety-critical features and calibrating responses at the embedding level.

Contribution

LATPC introduces a dynamic latent-space adversarial training and post-aware calibration method to improve safety and utility balance in defending LLMs against jailbreak attacks.

Findings

01

LATPC outperforms existing defenses on five jailbreak attack types.

02

It effectively reduces over-defense behaviors during inference.

03

The method leverages safety-critical dimensions for robust protection.

Abstract

Ensuring safety alignment is a critical requirement for large language models (LLMs), particularly given increasing deployment in real-world applications. Despite considerable advancements, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to circumvent safety measures and elicit harmful or inappropriate outputs. Furthermore, while adversarial training-based defense methods have shown promise, a prevalent issue is the unintended over-defense behavior, wherein models excessively reject benign queries, significantly undermining their practical utility. To address these limitations, we introduce LATPC, a Latent-space Adversarial Training with Post-aware Calibration framework. LATPC dynamically identifies safety-critical latent dimensions by contrasting harmful and benign inputs, enabling the adaptive construction of targeted refusal feature removal attacks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinykou/Against_Jailbreak
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning