A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations
Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang

TL;DR
This paper introduces PersonaConvBench, a comprehensive benchmark for evaluating personalized multi-turn conversations in large language models, combining personalization and conversational structure across diverse domains.
Contribution
It presents a new benchmark integrating personalization and conversation structure, with tasks and datasets for systematic evaluation of LLMs in personalized dialogue scenarios.
Findings
Personalized history improves LLM performance significantly.
198% relative gain in sentiment classification with personalization.
Benchmark facilitates research on adaptive, context-aware LLMs.
Abstract
We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper constructs the first benchmark that jointly models personalization and multi-turn dialogue, enabling systematic evaluation of LLMs’ ability to adapt to user-specific styles and evolving conversational context. By representing multi-user conversations as directed temporal graphs, the benchmark captures realistic branching, temporal ordering, and inter-user dependencies—allowing for fine-grained personalization and contextual reasoning that go beyond flat dialogue datasets.
Problem formulation lacks clarity: The notation is underspecified — in particular, while Cu (the user trajectory set) is later defined, the meaning of f is not clearly introduced where it first appears. This makes it difficult to precisely understand what constitutes the model input. Ambiguity in task setup and visibility scope: It is unclear whether the model has access to all users’ conversational trajectories or only those of the participants in the current dialogue. In real conversations,
+ Extensive, real-world dataset spans 10 Reddit domains—19,215 posts, ~111,239 conversations, 3,878 users, providing scale and diversity for robust evaluation of personalized conversational models. + Novel formulation combines graph-structured multi-user, multi-turn conversations with three tasks—sentiment classification, impact regression, and user-centric next-text generation—plus standardized in-context prompting and evaluation protocols. + Comprehensive LLM benchmarks reveal large personal
- The paper measures personalization mostly via performance deltas (P-Conv vs P-NonConv) and paired t-tests, rather than a direct “degree of personalization” metric or richer human judgments. - Heavy Reddit preprocessing (Nu, Nr, Np thresholds) and class-imbalance filtering (initial ~11:1 skew reduced to ~5:1) retained only ~6k sentiment posts. Removal of deleted/short posts create selection bias toward highly active users, reducing representativeness and real-world robustness. - Experiments run
1. The paper presents a multi-turn personalized dialogue benchmark derived from Reddit posts.
1. The paper provides limited ablation studies to support its experimental findings. 2. The dataset curation process based on Reddit data is not particularly novel. 3. Although the paper emphasizes conversational personalization, there is little evidence of incorporating personalization signals beyond dialogue history in the response generation process.
- The benchmark’s focus on evaluating LLM personalization using users’ past interaction histories is both well-motivated and highly realistic. - The proposed dataset is large-scale and diverse, spanning 10 domains and encompassing varied conversation styles. Its thoughtful construction incorporates temporal constraints and a graph-based representation of conversations. - Extensive experiments yield strong empirical results: leveraging users’ past interaction histories and dialog context consiste
- Evaluation Metrics: The evaluation of Personalized Text Generation primarily relies on n-gram overlap metrics and SBERT scores, using only a single reference response. Given the open-ended nature of dialog, there may be multiple valid responses for a given context, making these metrics potentially insufficient for capturing the full range of appropriate outputs. Additionally, the absence of human evaluation limits the assessment of response quality and relevance. - Research Findings: The resul
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
