Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang,, Wenqiang Lei

TL;DR
This paper introduces Legend, a cost-effective framework that uses representation engineering to automatically annotate safety margins in preference datasets, enhancing reward modeling and harmless alignment for large language models.
Contribution
Legend is the first framework to leverage a safety direction in embedding space for automatic margin annotation, improving dataset quality without additional training.
Findings
Effective in reward modeling and harmless alignment.
Requires only inference time, no extra training.
Scalable and easy to implement.
Abstract
The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop a dataset involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages representation engineering to annotate preference datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Data Quality and Management · Natural Language Processing Techniques
