TL;DR
This paper introduces a benchmark and a synthetic dataset to evaluate and improve large language models' ability to steer away from harmful content through preference-based fine-tuning, enhancing safety without sacrificing performance.
Contribution
It presents the C2-Eval benchmark, a synthetic dataset C2-Syn, and demonstrates that preference learning improves safety and course-correction in LLMs.
Findings
Enhanced safety and course-correction skills in LLMs.
Improved resistance to jailbreak attacks.
No significant impact on general performance.
Abstract
The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
