Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs
Yuxiao Lu, Arunesh Sinha, Pradeep Varakantham

TL;DR
This paper introduces a semantic loss-based method for fine-tuning LLMs to generate safer responses, requiring minimal unsafe data and improving safety without extensive human annotation or reliance on other LLMs.
Contribution
The authors propose a novel semantic loss combined with a negative EMD loss and a lower bound for efficient safety fine-tuning of LLMs using limited unsafe data.
Findings
Outperforms baseline methods in safety and data efficiency
Requires only a small set of unsafe responses for effective fine-tuning
Analyzes effects of over-alignment and language capability degradation
Abstract
Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Anomaly Detection Techniques and Applications · Fault Detection and Control Systems
MethodsSparse Evolutionary Training
