Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness
Avinash Amballa, Durga Sandeep Saluru, Gayathri Akkinapalli, Abhishek, Sureddy, Akshay Kumar Sureddy

TL;DR
This paper presents a method to improve the safety of instruction-tuned language models by incorporating safety instructions, significantly reducing unsafe responses while maintaining helpfulness.
Contribution
The study introduces a safety-focused instruction-tuning approach using Direct Preference Optimization, outperforming previous methods in safety without sacrificing helpfulness.
Findings
Safety responses increased from 40% to over 90%
DPO outperforms SIT and RAFT in safety tasks
Proposed evaluation framework assesses safety and helpfulness comprehensively
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation. However, these models can inadvertently generate unsafe or biased responses when prompted with problematic inputs, raising significant ethical and practical concerns for real-world deployment. This research addresses the critical challenge of developing language models that generate both helpful and harmless content, navigating the delicate balance between model performance and safety. We demonstrate that incorporating safety-related instructions during the instruction-tuning of pre-trained models significantly reduces toxic responses to unsafe prompts without compromising performance on helpfulness datasets. We found Direct Preference Optimization (DPO) to be particularly effective, outperforming both SIT and RAFT by leveraging both chosen and rejected responses for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety
