Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries
Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li, Kejiang Chen, Changxuan Fan, Tsun On Kwok, Weiming Zhang, Xiaomeng Li, Yangqiu Song

TL;DR
This paper reveals how domain-specific contexts can weaken LLM safety measures, introduces Jargon for effective adversarial attacks, and proposes safeguards to improve model safety without sacrificing helpfulness.
Contribution
It introduces Jargon, a novel framework combining safety research with adversarial interactions, and develops a safeguard policy to enhance LLM safety in context-sensitive scenarios.
Findings
Jargon achieves over 93% attack success rate across seven frontier models.
Activation space analysis shows queries occupy a gray zone between benign and harmful inputs.
The safeguard policy reduces attack success rates while maintaining helpfulness.
Abstract
A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
