Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment   after Instruction Tuning

Guanlin Li; Kangjie Chen; Shangwei Guo; Jie Zhang; Han Qiu; Chao; Zhang; Guoyin Wang; Tianwei Zhang; Jiwei Li

arXiv:2502.01116·cs.AI·February 4, 2025

Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

Guanlin Li, Kangjie Chen, Shangwei Guo, Jie Zhang, Han Qiu, Chao, Zhang, Guoyin Wang, Tianwei Zhang, Jiwei Li

PDF

Open Access

TL;DR

This study investigates how fine-tuning large language models on small datasets can unintentionally harm safety alignment, and it evaluates the reliability of reward models in guiding safe responses, revealing significant limitations.

Contribution

It systematically analyzes factors affecting safety in fine-tuned LLMs and assesses reward model reliability, providing insights for safer model development.

Findings

01

Fine-tuning can degrade safety alignment due to answer structure, identity calibration, and role-play.

02

Reward models often fail to accurately reflect human safety preferences.

03

The study offers guidance for balancing utility and safety in LLM fine-tuning.

Abstract

Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Infrastructure Maintenance and Monitoring · Pharmacy and Medical Practices