Legend: Leveraging Representation Engineering to Annotate Safety Margin   for Preference Datasets

Duanyu Feng; Bowen Qin; Chen Huang; Youcheng Huang; Zheng Zhang,; Wenqiang Lei

arXiv:2406.08124·cs.CL·December 19, 2024

Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang,, Wenqiang Lei

PDF

Open Access 1 Repo

TL;DR

This paper introduces Legend, a cost-effective framework that uses representation engineering to automatically annotate safety margins in preference datasets, enhancing reward modeling and harmless alignment for large language models.

Contribution

Legend is the first framework to leverage a safety direction in embedding space for automatic margin annotation, improving dataset quality without additional training.

Findings

01

Effective in reward modeling and harmless alignment.

02

Requires only inference time, no extra training.

03

Scalable and easy to implement.

Abstract

The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop a dataset involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages representation engineering to annotate preference datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

colfeng/legend
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Data Quality and Management · Natural Language Processing Techniques