Course-Correction: Safety Alignment Using Synthetic Preferences

Rongwu Xu; Yishuo Cai; Zhenhong Zhou; Renjie Gu; Haiqin Weng; Yan Liu,; Tianwei Zhang; Wei Xu; Han Qiu

arXiv:2407.16637·cs.CL·October 29, 2024

Course-Correction: Safety Alignment Using Synthetic Preferences

Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu,, Tianwei Zhang, Wei Xu, Han Qiu

PDF

1 Repo 7 Models 1 Video

TL;DR

This paper introduces a benchmark and a synthetic dataset to evaluate and improve large language models' ability to steer away from harmful content through preference-based fine-tuning, enhancing safety without sacrificing performance.

Contribution

It presents the C2-Eval benchmark, a synthetic dataset C2-Syn, and demonstrates that preference learning improves safety and course-correction in LLMs.

Findings

01

Enhanced safety and course-correction skills in LLMs.

02

Improved resistance to jailbreak attacks.

03

No significant impact on general performance.

Abstract

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C $^{2}$ -Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C $^{2}$ -Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pillowsofwind/course-correction
pytorchOfficial

Models

Videos

Course-Correction: Safety Alignment Using Synthetic Preferences· underline