Safe to Serve: Aligning Instruction-Tuned Models for Safety and   Helpfulness

Avinash Amballa; Durga Sandeep Saluru; Gayathri Akkinapalli; Abhishek; Sureddy; Akshay Kumar Sureddy

arXiv:2412.00074·cs.CL·December 3, 2024

Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

Avinash Amballa, Durga Sandeep Saluru, Gayathri Akkinapalli, Abhishek, Sureddy, Akshay Kumar Sureddy

PDF

Open Access

TL;DR

This paper presents a method to improve the safety of instruction-tuned language models by incorporating safety instructions, significantly reducing unsafe responses while maintaining helpfulness.

Contribution

The study introduces a safety-focused instruction-tuning approach using Direct Preference Optimization, outperforming previous methods in safety without sacrificing helpfulness.

Findings

01

Safety responses increased from 40% to over 90%

02

DPO outperforms SIT and RAFT in safety tasks

03

Proposed evaluation framework assesses safety and helpfulness comprehensively

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation. However, these models can inadvertently generate unsafe or biased responses when prompted with problematic inputs, raising significant ethical and practical concerns for real-world deployment. This research addresses the critical challenge of developing language models that generate both helpful and harmless content, navigating the delicate balance between model performance and safety. We demonstrate that incorporating safety-related instructions during the instruction-tuning of pre-trained models significantly reduces toxic responses to unsafe prompts without compromising performance on helpfulness datasets. We found Direct Preference Optimization (DPO) to be particularly effective, outperforming both SIT and RAFT by leveraging both chosen and rejected responses for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety