The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue   Agents

Yihong Tang; Kehai Chen; Xuefeng Bai; Zhengyu Niu; Bo Wang; Jie Liu,; Min Zhang

arXiv:2502.20757·cs.CL·March 3, 2025

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

Yihong Tang, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Bo Wang, Jie Liu,, Min Zhang

PDF

TL;DR

This paper investigates the safety-utility trade-off in role-playing dialogue agents powered by LLMs, proposing adaptive methods to balance safety and utility in risky scenarios, with improved safety metrics and maintained utility.

Contribution

It introduces the Adaptive Dynamic Multi-Preference (ADMP) method and Coupling Margin Sampling (CMS) to dynamically balance safety and utility in role-playing dialogue agents.

Findings

01

Improved safety metrics in risky scenarios.

02

Maintained utility in character simulations.

03

Effective handling of high-risk interactions.

Abstract

Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.