Alignment with Preference Optimization Is All You Need for LLM Safety

Reda Alami; Ali Khalifa Almansoori; Ahmed Alzubaidi; Mohamed El Amine; Seddik; Mugariya Farooq; Hakim Hacid

arXiv:2409.07772·cs.LG·September 13, 2024

Alignment with Preference Optimization Is All You Need for LLM Safety

Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine, Seddik, Mugariya Farooq, Hakim Hacid

PDF

Open Access

TL;DR

This paper shows that preference optimization techniques can significantly improve the safety of large language models, achieving near-perfect safety scores while highlighting a trade-off with some capabilities.

Contribution

It demonstrates that preference optimization alone can effectively enhance LLM safety and introduces Safe-NCA as an optimal alignment method balancing safety and performance.

Findings

01

Safety scores increased from 57.64% to 99.90%.

02

Toxicity benchmark scores decreased from over 0.6 to less than 0.07.

03

Trade-off observed between safety and mathematical capabilities.

Abstract

We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from $57.64%$ to $99.90%$ ) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over $0.6$ to less than $0.07$ . However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy