A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li Li; Peilin Cai; Ryan A. Rossi; Franck Dernoncourt; Branislav Kveton; Junda Wu; Tong Yu; Linxin Song; Tiankai Yang; Yuehan Qin; Nesreen K. Ahmed; Samyadeep Basu; Subhojyoti Mukherjee; Ruiyi Zhang; Zhengmian Hu; Bo Ni; Yuxiao Zhou; Zichao Wang; Yue Huang; Yu Wang; Xiangliang Zhang; Philip S. Yu; Xiyang Hu; Yue Zhao

arXiv:2505.14106·cs.CL·May 27, 2025

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces PersonaConvBench, a comprehensive benchmark for evaluating personalized multi-turn conversations in large language models, combining personalization and conversational structure across diverse domains.

Contribution

It presents a new benchmark integrating personalization and conversation structure, with tasks and datasets for systematic evaluation of LLMs in personalized dialogue scenarios.

Findings

01

Personalized history improves LLM performance significantly.

02

198% relative gain in sentiment classification with personalization.

03

Benchmark facilitates research on adaptive, context-aware LLMs.

Abstract

We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

The paper constructs the first benchmark that jointly models personalization and multi-turn dialogue, enabling systematic evaluation of LLMs’ ability to adapt to user-specific styles and evolving conversational context. By representing multi-user conversations as directed temporal graphs, the benchmark captures realistic branching, temporal ordering, and inter-user dependencies—allowing for fine-grained personalization and contextual reasoning that go beyond flat dialogue datasets.

Weaknesses

Problem formulation lacks clarity: The notation is underspecified — in particular, while Cu (the user trajectory set) is later defined, the meaning of f is not clearly introduced where it first appears. This makes it difficult to precisely understand what constitutes the model input. Ambiguity in task setup and visibility scope: It is unclear whether the model has access to all users’ conversational trajectories or only those of the participants in the current dialogue. In real conversations,

Reviewer 02Rating 6Confidence 4

Strengths

+ Extensive, real-world dataset spans 10 Reddit domains—19,215 posts, ~111,239 conversations, 3,878 users, providing scale and diversity for robust evaluation of personalized conversational models. + Novel formulation combines graph-structured multi-user, multi-turn conversations with three tasks—sentiment classification, impact regression, and user-centric next-text generation—plus standardized in-context prompting and evaluation protocols. + Comprehensive LLM benchmarks reveal large personal

Weaknesses

- The paper measures personalization mostly via performance deltas (P-Conv vs P-NonConv) and paired t-tests, rather than a direct “degree of personalization” metric or richer human judgments. - Heavy Reddit preprocessing (Nu, Nr, Np thresholds) and class-imbalance filtering (initial ~11:1 skew reduced to ~5:1) retained only ~6k sentiment posts. Removal of deleted/short posts create selection bias toward highly active users, reducing representativeness and real-world robustness. - Experiments run

Reviewer 03Rating 2Confidence 5

Strengths

1. The paper presents a multi-turn personalized dialogue benchmark derived from Reddit posts.

Weaknesses

1. The paper provides limited ablation studies to support its experimental findings. 2. The dataset curation process based on Reddit data is not particularly novel. 3. Although the paper emphasizes conversational personalization, there is little evidence of incorporating personalization signals beyond dialogue history in the response generation process.

Reviewer 04Rating 4Confidence 3

Strengths

- The benchmark’s focus on evaluating LLM personalization using users’ past interaction histories is both well-motivated and highly realistic. - The proposed dataset is large-scale and diverse, spanning 10 domains and encompassing varied conversation styles. Its thoughtful construction incorporates temporal constraints and a graph-based representation of conversations. - Extensive experiments yield strong empirical results: leveraging users’ past interaction histories and dialog context consiste

Weaknesses

- Evaluation Metrics: The evaluation of Personalized Text Generation primarily relies on n-gram overlap metrics and SBERT scores, using only a single reference response. Given the open-ended nature of dialog, there may be multiple valid responses for a given context, making these metrics potentially insufficient for capturing the full range of appropriate outputs. Additionally, the absence of human evaluation limits the assessment of response quality and relevance. - Research Findings: The resul

Code & Models

Repositories

persona-bench/persona
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems