BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian,, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang

TL;DR
The paper introduces the BeaverTails dataset, which separates helpfulness and harmlessness annotations for question-answer pairs, aiming to improve safety alignment in large language models through practical applications like content moderation and RLHF.
Contribution
It presents a large, annotated dataset for safety alignment in LLMs, enabling better safety measures and supporting research in content moderation and reinforcement learning with human feedback.
Findings
Dataset contains safety meta-labels for over 330,000 QA pairs.
Application examples include content moderation and RLHF.
Demonstrates potential for improving LLM safety.
Abstract
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PKU-Alignment/beaver-7b-v1.0model· 52 dl· ♡ 1352 dl♡ 13
- 🤗PKU-Alignment/beaver-7b-v1.0-rewardmodel· 378 dl· ♡ 17378 dl♡ 17
- 🤗PKU-Alignment/beaver-7b-v1.0-costmodel· 520 dl· ♡ 10520 dl♡ 10
- 🤗PKU-Alignment/beaver-7b-v2.0model· 17 dl17 dl
- 🤗PKU-Alignment/beaver-7b-v2.0-rewardmodel· 21 dl21 dl
- 🤗PKU-Alignment/beaver-7b-v2.0-costmodel· 10 dl10 dl
- 🤗PKU-Alignment/beaver-7b-v3.0model· 16 dl16 dl
- 🤗PKU-Alignment/beaver-7b-v3.0-rewardmodel· 31 dl31 dl
- 🤗PKU-Alignment/beaver-7b-v3.0-costmodel· 277 dl277 dl
- 🤗PKU-Alignment/beaver-7b-unified-rewardmodel· 175 dl175 dl
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling
