Towards Context-Invariant Safety Alignment for Large Language Models

Yixu Wang; Yang Yao; Xin Wang; Yifeng Gao; Yan Teng; Xingjun Ma; Yingchun Wang

arXiv:2605.20994·cs.CL·May 21, 2026

Towards Context-Invariant Safety Alignment for Large Language Models

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

PDF

TL;DR

This paper introduces Anchor Invariance Regularization (AIR), a novel method to improve context-invariant safety alignment in large language models by anchoring open-ended prompt responses to verifiable ones.

Contribution

The paper proposes AIR, a plug-in auxiliary loss that enhances safety robustness by enforcing invariance using verifiable prompts as anchors, combined with group-based preference optimization.

Findings

01

AIR improves in-distribution group accuracy by 12.71%.

02

AIR boosts out-of-distribution consistency by 33.49%.

03

Enhances safety robustness against adversarial framings.

Abstract

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.