OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale   Synthetic Personas

Xiaoyang Wang; Hongming Zhang; Tao Ge; Wenhao Yu; Dian Yu; Dong Yu

arXiv:2501.15427·cs.CL·February 19, 2025

OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas

Xiaoyang Wang, Hongming Zhang, Tao Ge, Wenhao Yu, Dian Yu, Dong Yu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large-scale synthetic data approach to train customizable role-playing LLMs, enabling effective character generalization and competitive performance with GPT-4o models.

Contribution

It proposes a novel data synthesis method using persona profiles and response strategies to enhance role-playing capabilities in LLMs, with publicly released resources.

Findings

01

Model outperforms original LLaMA-3 8B Instruct on role-playing tasks.

02

Synthetic data improves character consistency and dialogue quality.

03

Performance is comparable to GPT-4o models.

Abstract

Customizable role-playing in large language models (LLMs), also known as character generalization, is gaining increasing attention for its versatility and cost-efficiency in developing and deploying role-playing dialogue agents. This study explores a large-scale data synthesis approach to equip LLMs with character generalization capabilities. We begin by synthesizing large-scale character profiles using personas from Persona Hub and then explore two strategies: response rewriting and response generation, to create character-aligned instructional responses. To validate the effectiveness of our synthetic instruction tuning data for character generalization, we perform supervised fine-tuning (SFT) using the LLaMA-3 8B model. Our best-performing model strengthens the original LLaMA-3 8B Instruct model and achieves performance comparable to GPT-4o models on role-playing dialogue. We release…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xywang1/OpenCharacter
dataset· 153 dl
153 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · Innovative Human-Technology Interaction · AI in Service Interactions

MethodsSoftmax · Attention Is All You Need