Safety-Aware Fine-Tuning of Large Language Models

Hyeong Kyu Choi; Xuefeng Du; Yixuan Li

arXiv:2410.10014·cs.CL·October 15, 2024

Safety-Aware Fine-Tuning of Large Language Models

Hyeong Kyu Choi, Xuefeng Du, Yixuan Li

PDF

Open Access

TL;DR

This paper introduces SAFT, a framework for automatically filtering harmful data during large language model fine-tuning, significantly reducing harmful content and enhancing safety without manual intervention.

Contribution

SAFT is a novel automatic safety-aware fine-tuning method that leverages subspace information to detect and remove harmful data samples.

Findings

01

Reduces harmfulness by up to 27.8% across models and contamination levels.

02

Demonstrates effectiveness and versatility in practical safety scenarios.

03

Provides insights into the mechanism of harmful data detection.

Abstract

Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling