Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Weixiang Zhao; Yulin Hu; Yang Deng; Jiahe Guo; Xingyu Sui; Xinyang Han; An Zhang; Yanyan Zhao; Bing Qin; Tat-Seng Chua; Ting Liu

arXiv:2502.20968·cs.CL·May 28, 2025

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

PDF

TL;DR

This paper assesses safety risks in role-play fine-tuning of LLMs, revealing safety performance declines and proposing SaRFT to balance role capabilities with safety, validated across multiple models.

Contribution

It provides the first comprehensive evaluation of role-play fine-tuning safety risks and introduces SaRFT, a novel method to mitigate these risks while maintaining role performance.

Findings

01

Role-play fine-tuning reduces safety performance.

02

Safety risks vary with character traits.

03

SaRFT outperforms existing baselines in safety and role ability.

Abstract

Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.