Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda; Avinash Rai

arXiv:2511.10519·cs.CL·November 14, 2025

Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda, Avinash Rai

PDF

Open Access

TL;DR

This paper reveals that linguistic styles can be exploited to bypass safety measures in large language models, and proposes a style neutralization method to mitigate this vulnerability.

Contribution

It introduces a systematic study of stylistic jailbreaks, creates a benchmark with style-augmented prompts, and proposes a style neutralization defense mechanism.

Findings

01

Stylistic reframing increases jailbreak success rates by up to 57%.

02

Fearful, curious, and compassionate styles are most effective.

03

Style neutralization reduces jailbreak success significantly.

Abstract

Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Hate Speech and Cyberbullying Detection