Mitigating Jailbreaks with Intent-Aware LLMs

Wei Jie Yeo; Ranjan Satapathy; Erik Cambria

arXiv:2508.12072·cs.CR·August 26, 2025

Mitigating Jailbreaks with Intent-Aware LLMs

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria

PDF

3 Reviews

TL;DR

This paper introduces Intent-FT, a lightweight fine-tuning method that enhances large language models' robustness against jailbreak attacks by training them to infer instruction intent, significantly reducing attack success rates while maintaining performance.

Contribution

The paper proposes Intent-FT, a novel fine-tuning approach that improves LLMs' ability to detect harmful intent in instructions, effectively mitigating jailbreak attacks and preserving model utility.

Findings

01

Intent-FT reduces attack success rates below 50% across various attack types.

02

Models trained with Intent-FT better identify hidden harmful intents.

03

The method maintains model capabilities and reduces unnecessary refusals.

Abstract

Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding. By fine-tuning on a targeted set of adversarial instructions, Intent-FT enables LLMs to generalize intent deduction to unseen attacks, thereby substantially improving their robustness. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

1. This paper proposes an intent-aware lightweight fine-tuning framework (Intent-FT) that incorporates intent inference to strengthen model’s defense against jailbreak attacks. 2. Compared to defense baselines, the experimental results show that the proposed Intent-FT consistently exhibits robust defense across different attack types and models.

Weaknesses

1. The paper's core contribution is the introduction of fine-tuning for identifying the intention behind a query. While I respect the idea, the contribution is relatively incremental and provides limited further insight into defense. 2. Although the evaluation includes Adaptive Attack, some recent attack methods such as I-GCG [1] and DRL [2] are not considered. Including these would provide a more comprehensive assessment. [1] Jia, Xiaojun, et al. "Improved Techniques for Optimization-Based J

Reviewer 02Rating 8Confidence 3

Strengths

1. Comprehensive evaluation: It tests both parametric (Harmful-FT) and non-parametric (PAIR, AA) attacks, covering open-source and proprietary models, which ensures robustness of results (e.g., Llama’s INTENT-FT reduces PAIR ASR to 19% vs Vanilla’s 88%). 2. Targeted over-refusal mitigation: Unlike baselines (e.g., Safety-FT) that increase over-refusal on XSTEST, INTENT-FT lowers refusal rates for Llama and GPT-4.1, as it trains on both harmful and benign intent deduction. 3. Intent transferabi

Weaknesses

1. Limited white-box attack coverage: It only tests Ablation and ActAdd white-box methods; other techniques like CipherChat or Autodan are unexamined, which may underestimate real-world threats—adding these stronger attacks could improve generalizability. 2. Narrow dataset scale testing: The impact of D_I size is only tested up to 100 samples; larger D_I (e.g., 500+) or diverse datasets (e.g., industry-specific harmful prompts) are untested, leaving scalability unclear.

Reviewer 03Rating 6Confidence 4

Strengths

- The motivation is intuitive, and the method is simple and broadly applicable. - Although it introduces additional computational cost during fine-tuning, the overhead appears to be marginal (not sure), and the empirical results are strong.

Weaknesses

This paper presents a simple yet effective method that is appealing to me, but I believe it can be improved with the following recommendations: - Important experimental setup details are missing. For example, the fine-tuning appears marginal (100 samples, as noted in line 192), but key details such as the number of epochs/iterations, wall-clock fine-tuning time, and whether LoRA or full fine-tuning was used are not provided. - Presentation could be improved: - The notation is confusing. Subs

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.