Controllable Safety Alignment: Inference-Time Adaptation to Diverse   Safety Requirements

Jingyu Zhang; Ahmed Elgohary; Ahmed Magooda; Daniel Khashabi; Benjamin; Van Durme

arXiv:2410.08968·cs.CL·March 5, 2025·2 cites

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, Benjamin, Van Durme

PDF

Open Access 5 Datasets 1 Video

TL;DR

Controllable Safety Alignment (CoSA) enables large language models to adapt their safety behaviors at inference time based on user-defined safety configurations, enhancing flexibility across diverse social norms without retraining.

Contribution

We introduce CoSAlign, a data-centric method for dynamic safety alignment, and develop CoSA-Score and CoSApien benchmark to evaluate and facilitate adaptable safety in LLMs.

Findings

01

CoSAlign significantly improves controllability over baseline methods.

02

The framework allows real-time safety behavior adjustments via safety configs.

03

Our evaluation shows increased alignment with diverse safety requirements.

Abstract

The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements· slideslive

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Risk and Safety Analysis · Software Reliability and Analysis Research

MethodsALIGN