Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
Chak Tou Leong, Qingyu Yin, Jian Wang, Wenjie Li

TL;DR
This paper investigates how the reliance on fixed templates in large language models' safety mechanisms makes them vulnerable to jailbreak attacks, and proposes detaching safety features from templates to improve robustness.
Contribution
It identifies template-anchored safety alignment as a widespread vulnerability in aligned LLMs and demonstrates that detaching safety mechanisms from templates can mitigate jailbreak susceptibility.
Findings
Template-anchored safety alignment is common across various LLMs.
Models are vulnerable to inference-time jailbreak attacks due to template reliance.
Detaching safety mechanisms from templates reduces jailbreak vulnerabilities.
Abstract
The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMaritime Navigation and Safety
