Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing

Wenhao Liu; Siyu An; Junru Lu; Muling Wu; Tianlong Li; Xiaohua Wang; Changze lv; Xiaoqing Zheng; Di Yin; Xing Sun; Xuanjing Huang

arXiv:2409.16913·cs.AI·June 16, 2025

Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing

Wenhao Liu, Siyu An, Junru Lu, Muling Wu, Tianlong Li, Xiaohua Wang, Changze lv, Xiaoqing Zheng, Di Yin, Xing Sun, Xuanjing Huang

PDF

Open Access 3 Reviews

TL;DR

This paper develops an evaluation benchmark and a representation editing method to improve role-playing agents' ability to recognize and refuse conflicting requests, enhancing their refusal accuracy without losing role-playing performance.

Contribution

It introduces a novel benchmark for conflict detection in RPAs and proposes a lightweight representation editing technique to improve refusal capabilities.

Findings

01

RPAs show significant performance gaps on conflicting requests.

02

Representation analysis reveals rejection and response regions affecting behavior.

03

The editing method effectively increases refusal accuracy while preserving role-playing abilities.

Abstract

Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs' performance when faced with different types of conflicting requests, we develop an evaluation benchmark that includes contextual knowledge conflicting requests, parametric knowledge conflicting requests, and non-conflicting requests to assess RPAs' ability to identify conflicts and refuse to answer appropriately without over-refusing. Through extensive evaluation, we find that most RPAs behave significant performance gaps toward different conflict requests. To elucidate the reasons, we conduct an in-depth representation-level analysis of RPAs under various conflict scenarios. Our findings reveal the existence of rejection regions and direct response regions…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

The refusal capabilities appear to be an important issue indeed. It is interesting to see work towards this direction to improve the RPAs' performance. The paper is well written -- the study follows a step by step procedure, from evaluation design to comparison, and finally, to methods to improve the RPAs. It is well organized and easy to follow, with lots of examples to ease understanding. Given the examples provided in the paper, it is convincing how the work introduced could help on the RPA

Weaknesses

The research methodology is kind of straightforward; it is simple what the authors intend to do and they made it via a sound process. For the same reason, it is not obvious what the challenges are for this study. The representation editing method intervenes with the representations generated by the model to enhance the refusal ability for conflicting cases. It is compared with several fine-tuning methods designed for LLMs. But essentially, it may not be of the same nature as the compared method

Reviewer 02Rating 6Confidence 4

Strengths

1.The paper studies an interesting and important problem. Enhancing RPA’s ability to refuse questions they do not know could important implications for various applications, like virtual assistants and game design. 2.I like the representation analysis part. I believe it is a novel finding to identify "rejection regions" and "direct response regions". The analysis provides adequate motivations for the proposed representation editing method. 3. The authors provide extensive experiments to demo

Weaknesses

1. The paper could benefit from including user-centric studies to evaluate the real-world impact of enhanced refusal capabilities. 2.While the empirical findings are strong, the theoretical underpinning of the rejection and response regions may require further exploration to enhance understanding.

Reviewer 03Rating 6Confidence 4

Strengths

1. propose a well-motivated benchmark 2. the data construction pipeline is plausible 3. conduct interpretability experiment to analyze results 4. develop a model editing method based on the representation discoveries In general, this paper raises interesting research questions and also conduct in-depth analysis.

Weaknesses

1. The editing method does not analyze how much the method affects other non-relevant questions, such as questions independent to the role-playing. So the general accuracy of the thresholding method needs a comprehensive analysis

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Reinforcement Learning in Robotics