Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni, Jonathan Cola\c{c}o Carr, Yash More, Jackie CK Cheung, Golnoosh Farnadi

TL;DR
This paper audits the Helpful and Harmless dataset used in training large language models, revealing quality issues and safety disparities that suggest the need for more nuanced safety approaches.
Contribution
It provides a comprehensive evaluation of the HH dataset, demonstrating its limitations and impact on safety, and analyzes influential papers citing it.
Findings
The dataset contains conceptualization failures affecting safety.
Quality issues can lead to safety disparities across demographics.
Auditing reveals the need for more nuanced safety mitigation methods.
Abstract
In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards outputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset's content through both manual and automated evaluation; (2) experiments demonstrating the dataset's impact on models' safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Quality and Management
