From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma

TL;DR
This paper introduces TSC-GRPO, a novel framework that enhances language model safety by disentangling and pinning semantic intent, effectively defending against adversarial prefix attacks while maintaining utility.
Contribution
It proposes a causal intent disentanglement method and a two-stage training process to improve language model robustness against adversarial prompts.
Findings
TSC-GRPO outperforms baselines in jailbreak attack resistance.
The method maintains model utility while enhancing safety.
Causal intent probing effectively isolates malicious intent signals.
Abstract
Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
