Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, Benjamin, Van Durme

TL;DR
Controllable Safety Alignment (CoSA) enables large language models to adapt their safety behaviors at inference time based on user-defined safety configurations, enhancing flexibility across diverse social norms without retraining.
Contribution
We introduce CoSAlign, a data-centric method for dynamic safety alignment, and develop CoSA-Score and CoSApien benchmark to evaluate and facilitate adaptable safety in LLMs.
Findings
CoSAlign significantly improves controllability over baseline methods.
The framework allows real-time safety behavior adjustments via safety configs.
Our evaluation shows increased alignment with diverse safety requirements.
Abstract
The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Risk and Safety Analysis · Software Reliability and Analysis Research
MethodsALIGN
