CharacterBench: Benchmarking Character Customization of Large Language   Models

Jinfeng Zhou; Yongkang Huang; Bosi Wen; Guanqun Bi; Yuxuan Chen; Pei; Ke; Zhuang Chen; Xiyao Xiao; Libiao Peng; Kuntian Tang; Rongsheng Zhang; Le; Zhang; Tangjie Lv; Zhipeng Hu; Hongning Wang; Minlie Huang

arXiv:2412.11912·cs.CL·December 17, 2024

CharacterBench: Benchmarking Character Customization of Large Language Models

Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei, Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le, Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

CharacterBench is a comprehensive bilingual benchmark with over 22,000 samples designed to evaluate and improve large language models' ability to customize characters in dialogue, addressing limitations of previous benchmarks.

Contribution

We introduce CharacterBench, the largest detailed benchmark for character customization in LLMs, along with CharacterJudge for efficient evaluation, enhancing the assessment of character-based dialogue capabilities.

Findings

01

CharacterJudge outperforms GPT-4 in evaluation accuracy.

02

CharacterBench covers 3,956 characters across 25 categories.

03

Our methods improve LLMs' character customization performance.

Abstract

Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-coai/characterbench
noneOfficial

Models

🤗
thu-coai/CharacterJudge
model· 22 dl· ♡ 4
22 dl♡ 4

Videos

CharacterBench: Benchmarking Character Customization of Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods