Lifelong Safety Alignment for Language Models

Haoyu Wang; Zeyu Qin; Yifei Zhao; Chao Du; Min Lin; Xueqian Wang; Tianyu Pang

arXiv:2505.20259·cs.CR·May 27, 2025

Lifelong Safety Alignment for Language Models

Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lifelong safety alignment framework for LLMs that continuously adapts to new jailbreaking strategies by training a Meta-Attacker and Defender in a competitive setup, significantly improving robustness.

Contribution

It presents a novel lifelong safety alignment framework with a Meta-Attacker and Defender, leveraging research insights to enhance LLM safety against unseen attacks.

Findings

01

Meta-Attacker achieves 73% attack success rate initially

02

Defender reduces attack success rate to 7% after training

03

Framework enables safer deployment of LLMs in open environments

Abstract

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/lifelongsafetyalignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus