PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Jiaming Ji; Donghai Hong; Borong Zhang; Boyuan Chen; Juntao Dai; Boren Zheng; Tianyi Qiu; Jiayi Zhou; Kaile Wang; Boxuan Li; Sirui Han; Yike Guo; Yaodong Yang

arXiv:2406.15513·cs.AI·June 17, 2025·2 cites

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, Yaodong Yang

PDF

Open Access 2 Datasets

TL;DR

This paper introduces PKU-SafeRLHF, a large dataset for safety alignment of LLMs, with detailed annotations on helpfulness and harmlessness, enabling improved safety-focused training and moderation of language models.

Contribution

The paper presents a new safety preference dataset with decoupled annotations and severity levels, facilitating advanced safety alignment and moderation techniques for LLMs.

Findings

01

Developed 44.6k prompts and 265k QA pairs with safety labels

02

Collected 166.8k preference annotations for safety training

03

Enabled severity-sensitive moderation and safety RLHF algorithms

Abstract

In this study, we introduce the safety human preference dataset, PKU-SafeRLHF, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Software Engineering Research