Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt   Templates

Kaifeng Lyu; Haoyu Zhao; Xinran Gu; Dingli Yu; Anirudh Goyal; Sanjeev; Arora

arXiv:2402.18540·cs.LG·January 20, 2025·3 cites

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev, Arora

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper investigates how prompt templates influence the safety alignment of LLMs after fine-tuning and introduces the PTST strategy, which improves safety preservation during model adaptation for specific tasks.

Contribution

It reveals the critical role of prompt templates in maintaining alignment and proposes the PTST method, a novel fine-tuning approach that enhances safety in LLMs.

Findings

01

Prompt templates significantly impact safety alignment post-fine-tuning.

02

PTST reduces unsafe behaviors in models across multiple benchmarks.

03

Fine-tuning without safety prompts at training, but including them at testing, improves safety.

Abstract

Public LLMs such as the Llama 2-Chat underwent alignment training and were considered safe. Recently Qi et al. [2024] reported that even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. We focus on the setting where a public model is fine-tuned before serving users for specific usage, where the model should improve on the downstream task while maintaining alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the ``Pure Tuning, Safe Testing'' (PTST) strategy -- fine-tune models without a safety prompt, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vfleaking/ptst
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTaxation and Legal Issues

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Focus · Linear Layer · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dropout · Layer Normalization · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout