One-Shot Safety Alignment for Large Language Models via Optimal   Dualization

Xinmeng Huang; Shuo Li; Edgar Dobriban; Osbert Bastani; Hamed Hassani,; Dongsheng Ding

arXiv:2405.19544·cs.AI·November 25, 2024

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani,, Dongsheng Ding

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel dualization approach to simplify and stabilize the process of aligning large language models with safety constraints, significantly reducing computational costs.

Contribution

It proposes a dualization-based method that converts constrained alignment into an unconstrained problem, enabling efficient and stable training of safe language models.

Findings

01

Reduces computational cost of safety alignment

02

Improves training stability for large language models

03

Demonstrates effectiveness through extensive experiments

Abstract

The growing safety concerns surrounding large language models raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shuoli90/CAN
pytorchOfficial

Videos

One-Shot Safety Alignment for Large Language Models via Optimal Dualization· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsALIGN