Efficient Safety Retrofitting Against Jailbreaking for LLMs

Dario Garcia-Gasulla; Adrian Tormos; Anna Arias-Duart; Daniel Hinjos,; Oscar Molina-Sedano; Ashwin Kumar Gururajan; Maria Eugenia Cardello

arXiv:2502.13603·cs.CL·February 26, 2025

Efficient Safety Retrofitting Against Jailbreaking for LLMs

Dario Garcia-Gasulla, Adrian Tormos, Anna Arias-Duart, Daniel Hinjos,, Oscar Molina-Sedano, Ashwin Kumar Gururajan, Maria Eugenia Cardello

PDF

Open Access 4 Models 1 Datasets

TL;DR

This paper demonstrates that applying Direct Preference Optimization (DPO) with a new safety dataset Egida effectively enhances large language model safety against jailbreaking, with minimal data and cost, while maintaining performance on general tasks.

Contribution

It introduces Egida, a comprehensive safety dataset, and shows how DPO can significantly improve LLM safety with low training effort and cost, highlighting the influence of model size and pre-training.

Findings

01

Models trained with DPO reduce attack success rates by 10-30%.

02

Safety improvements generalize to unseen topics and attack styles.

03

Low-cost training (around $20 for large models) effectively enhances safety.

Abstract

Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

HPAI-BSC/Egida
dataset· 285 dl
285 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Cryptography and Data Security · Digital and Cyber Forensics

MethodsDirect Preference Optimization