When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao; Sana Tonekaboni; Walter Gerych; Vinith Suriyakumar; Marzyeh Ghassemi

arXiv:2506.07452·cs.LG·February 26, 2026

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi

PDF

1 Repo 3 Reviews

TL;DR

This paper investigates how superficial style patterns in prompts can increase LLM vulnerability to jailbreaks, revealing that style alignment impacts safety and proposing SafeStyle as an effective mitigation strategy.

Contribution

It uncovers the role of style patterns in LLM safety risks and introduces SafeStyle, a novel fine-tuning approach to mitigate style-based jailbreak vulnerabilities.

Findings

01

Nearly all models show ASR inflation due to style patterns.

02

ASR inflation correlates with attention to style patterns.

03

SafeStyle consistently improves safety across models and styles.

Abstract

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 36 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper conducts extensive experiments across a large set of LLMs, providing strong empirical evidence for the identified vulnerability. The experimental design is solid and systematic, covering multiple styles, models, and datasets to validate the observed phenomenon. It investigates an under-studied but important topic in LLM safety—style compliance—highlighting a novel and practical risk overlooked by prior alignment research. The proposed defense method (SafeStyle) is simple yet effectiv

Weaknesses

While the paper provides strong empirical evidence for style-induced vulnerabilities and introduces an effective defense, the utility evaluation remains limited. The authors mainly rely on a style adaptation score (e.g., LC_WR) to demonstrate that SafeStyle preserves stylistic performance. However, this metric only measures the model’s ability to follow styles that were explicitly seen during fine-tuning, rather than testing general instruction-following or reasoning capabilities. Consequently,

Reviewer 02Rating 6Confidence 4

Strengths

(1) Clear phenomenon- “ASR inflation” characterizes a real, under-discussed confound in jailbreak evaluation. The paper shows that original benchmark prompts often combine “style + malicious intent,” and the style piece alone can tip models into compliance. (2) Breadth of evidence- 32 open LLMs across 7 benchmarks; result: 28/32 show higher ASR with style present; SorryBench and MedSafetyBench impact the most models. Family trends (Mistral highest; Gemma lowest) are informative. (3) Mechanisti

Weaknesses

(1) Reliance on GPT-4o for multiple roles- GPT-4o is used to (i) extract malicious intents, (ii) judge ASR, (iii) generate responses/safety data. This centralizes evaluation and could bias both labeling and defense wins toward the same model family’s preferences; more judge diversity would strengthen claims. (2) Generalization beyond open models/non-reasoning models- Claims are strongest for open models (Llama/Gemma/Qwen/Mistral/OLMo). It remains unclear how large frontier/proprietary models be

Reviewer 03Rating 2Confidence 5

Strengths

1. The paper tackles an important question — how “style patterns” influence safety — and studies it across two realistic threat settings (jailbreak attacks and instruction-finetuning attacks), providing a focused experimental pipeline for observing safety changes. 2. The authors provide a detailed analysis of how safety-oriented data can mitigate the risks that arise from style patterns in instruction fine-tuning. 3. The paper is well written and easy to follow.

Weaknesses

1. The paper offers examples of style patterns (e.g., “create a list of xx”). In the experiments, the authors extract the core harmful intent of the original malicious instructions and remove other content, treating all removed content as style patterns. I find this operation overly broad and unconvincing: there is no justification for treating everything that was removed as a style pattern. A stricter, better-controlled design is needed — for example, explicitly appending the hypothesized style

Code & Models

Repositories

xiaoyuxin1002/SafeStyle
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need