Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding; Jun Kuang; Wen Sun; Zongyu Wang; Xuezhi Cao; Xunliang Cai; Jiajun Chen; Shujian Huang

arXiv:2511.00556·cs.CL·November 4, 2025

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang

PDF

Open Access

TL;DR

This paper introduces ISA, a novel intent shift attack that minimally modifies prompts to deceive LLMs into misperceiving harmful requests as benign, exposing vulnerabilities in current safety mechanisms and highlighting the need for better defenses.

Contribution

We propose ISA, a new attack method that obfuscates intent with minimal edits, achieving high success rates and exposing weaknesses in existing safety defenses.

Findings

01

ISA achieves over 70% success rate in attacks.

02

Fine-tuning with ISA templates nearly 100% success.

03

Existing defenses are ineffective against ISA.

Abstract

Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling