Towards Context-Invariant Safety Alignment for Large Language Models
Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

TL;DR
This paper introduces Anchor Invariance Regularization (AIR), a novel method to improve context-invariant safety alignment in large language models by anchoring open-ended prompt responses to verifiable ones.
Contribution
The paper proposes AIR, a plug-in auxiliary loss that enhances safety robustness by enforcing invariance using verifiable prompts as anchors, combined with group-based preference optimization.
Findings
AIR improves in-distribution group accuracy by 12.71%.
AIR boosts out-of-distribution consistency by 33.49%.
Enhances safety robustness against adversarial framings.
Abstract
Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
