BeaverTails: Towards Improved Safety Alignment of LLM via a   Human-Preference Dataset

Jiaming Ji; Mickel Liu; Juntao Dai; Xuehai Pan; Chi Zhang; Ce Bian,; Chi Zhang; Ruiyang Sun; Yizhou Wang; Yaodong Yang

arXiv:2307.04657·cs.CL·November 8, 2023·34 cites

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian,, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang

PDF

Open Access 1 Repo 10 Models 5 Datasets 1 Video

TL;DR

The paper introduces the BeaverTails dataset, which separates helpfulness and harmlessness annotations for question-answer pairs, aiming to improve safety alignment in large language models through practical applications like content moderation and RLHF.

Contribution

It presents a large, annotated dataset for safety alignment in LLMs, enabling better safety measures and supporting research in content moderation and reinforcement learning with human feedback.

Findings

01

Dataset contains safety meta-labels for over 330,000 QA pairs.

02

Application examples include content moderation and RLHF.

03

Demonstrates potential for improving LLM safety.

Abstract

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-alignment/safe-sora
pytorch

Models

Datasets

Videos

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset· slideslive

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling