Efficient Safety Retrofitting Against Jailbreaking for LLMs
Dario Garcia-Gasulla, Adrian Tormos, Anna Arias-Duart, Daniel Hinjos,, Oscar Molina-Sedano, Ashwin Kumar Gururajan, Maria Eugenia Cardello

TL;DR
This paper demonstrates that applying Direct Preference Optimization (DPO) with a new safety dataset Egida effectively enhances large language model safety against jailbreaking, with minimal data and cost, while maintaining performance on general tasks.
Contribution
It introduces Egida, a comprehensive safety dataset, and shows how DPO can significantly improve LLM safety with low training effort and cost, highlighting the influence of model size and pre-training.
Findings
Models trained with DPO reduce attack success rates by 10-30%.
Safety improvements generalize to unseen topics and attack styles.
Low-cost training (around $20 for large models) effectively enhances safety.
Abstract
Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Cryptography and Data Security · Digital and Cyber Forensics
MethodsDirect Preference Optimization
