Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large   Language Models

Francisco Eiras; Aleksandar Petrov; Philip H.S. Torr; M. Pawan Kumar,; Adel Bibi

arXiv:2406.10288·cs.CL·March 3, 2025

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras, Aleksandar Petrov, Philip H.S. Torr, M. Pawan Kumar,, Adel Bibi

PDF

Open Access

TL;DR

This paper investigates safety risks in task-specific fine-tuning of large language models, revealing how malicious data manipulations can induce dangerous behaviors, and proposes a mitigation strategy that mixes safety data to restore safety without sacrificing performance.

Contribution

It introduces a novel mitigation method that combines safety data mimicking task formats, effectively reducing risks while preserving task-specific performance.

Findings

01

Maliciously manipulated datasets can significantly increase dangerous model behaviors.

02

Mixing safety data in task-specific fine-tuning effectively mitigates safety risks.

03

The proposed method outperforms existing baselines in safety and efficiency.

Abstract

Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning - where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions) - can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques