Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle
Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit, Garg, Atabak Ashfaq, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen, Zhang, Hiteshi Sharma, Blake Bullwinkel, Martin Pouliot, Amanda Minnich,, Shiven Chawla, Solianna Herrera, Shahed Warreth

TL;DR
This paper introduces a safety alignment methodology for Phi-3 language models using an iterative 'break-fix' cycle involving dataset curation, safety post-training, benchmarking, and red teaming to enhance responsible AI performance.
Contribution
It presents a novel iterative safety alignment process for small, deployable language models, improving their safety and alignment through multiple rounds of targeted interventions.
Findings
Iterative safety training improved model alignment across benchmarks.
Red teaming identified vulnerabilities and guided safety improvements.
Models demonstrated enhanced safety behavior in multilingual scenarios.
Abstract
Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/Phi-4-multimodal-instructmodel· 300k dl· ♡ 1576300k dl♡ 1576
- 🤗microsoft/Phi-3.5-mini-instructmodel· 919k dl· ♡ 966919k dl♡ 966
- 🤗microsoft/MediPhimodel· 4.2k dl· ♡ 194.2k dl♡ 19
- 🤗microsoft/MediPhi-PubMedmodel· 155 dl· ♡ 9155 dl♡ 9
- 🤗microsoft/MediPhi-MedWikimodel· 35 dl· ♡ 335 dl♡ 3
- 🤗microsoft/MediPhi-Instructmodel· 4.8k dl· ♡ 614.8k dl♡ 61
- 🤗askalgore/Phi-3.5-mini-instruct-hereticmodel· 10 dl· ♡ 110 dl♡ 1
- 🤗microsoft/Phi-3.5-MoE-instructmodel· 94k dl· ♡ 57194k dl♡ 571
- 🤗unsloth/Phi-3.5-mini-instructmodel· 3.9k dl· ♡ 463.9k dl♡ 46
- 🤗unsloth/Phi-3.5-mini-instruct-bnb-4bitmodel· 15k dl· ♡ 1315k dl♡ 13
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
