Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe   Responses in LLMs

Yuxiao Lu; Arunesh Sinha; Pradeep Varakantham

arXiv:2412.06843·cs.CL·December 12, 2024

Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs

Yuxiao Lu, Arunesh Sinha, Pradeep Varakantham

PDF

Open Access

TL;DR

This paper introduces a semantic loss-based method for fine-tuning LLMs to generate safer responses, requiring minimal unsafe data and improving safety without extensive human annotation or reliance on other LLMs.

Contribution

The authors propose a novel semantic loss combined with a negative EMD loss and a lower bound for efficient safety fine-tuning of LLMs using limited unsafe data.

Findings

01

Outperforms baseline methods in safety and data efficiency

02

Requires only a small set of unsafe responses for effective fine-tuning

03

Analyzes effects of over-alignment and language capability degradation

Abstract

Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Anomaly Detection Techniques and Applications · Fault Detection and Control Systems

MethodsSparse Evolutionary Training