TL;DR
TwinVoice is a comprehensive benchmark designed to evaluate LLMs' ability to simulate diverse human personas across social, private, and narrative contexts, highlighting current limitations in syntactic style and memory recall.
Contribution
The paper introduces TwinVoice, a multi-dimensional benchmark for systematic assessment of LLM-based persona simulation in real-world scenarios.
Findings
Advanced LLMs achieve moderate accuracy in persona simulation
Models struggle with syntactic style and memory recall
Performance remains below human baseline
Abstract
Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper could provide a framework for detailed analysis on digital twin (or human persona replication). - The paper provides some discussion about the strength and weakness of present LLMs. - The paper tested multiple models, which strengthens its generalizability.
- The experimental method has issues. The selection of language models is not systematic (and there's no reason specified in the paper), and the judge model overlaps with the generation model. The agreement measurement is not proper. See Question A. - The paper's review on the previous benchmark or papers is somewhat shallow. Other researchers have been reported similar findings, especially on memory recall and persona consistency. Though the paper aims to be a unified framework, the paper shoul
Please see the weakness section.
**Significant ICLR formatting violation** This submission appears to use 1.0 inch left / right margins, significantly below the regulation of 1.5 inch (“Formatting instructions for ICLR 2026 conference submissions,” Line 30, Line 50). This expands the text width from the mandated 5.5 inch to 5.5 + 0.5 * 2 = 6.5 inch, so 9 pages × (6.5 / 5.5) = 10.64 pages of effective content, exceeding the strict 9-page limit (“At the time of submission, the main text should be 9 pages or fewer… This limit wil
* The proposed TwinVoice benchmark is more comprehensive than previous works: * TwinVoice has over 4,500 personas for evaluation, exceeding the size of prior works. The persona dataset is dissected into 3 categories: social persona, interpersonal persona, and narrative persona. This improves the diversity and robustness of persona fidelity evaluation. * TwinVoice consists of both real-world and synthetic data, resolving the issue with dominating synthetic data usage in existing works. * The
* While I appreciate the authors for conducting a human study on the proposed LLM-as-a-judge evaluation framework, I still have minor concerns about the pipeline's robustness: * First, for the human verification scale, 50 items per judging mode (100 total) for such a big evaluation benchmark might not be enough. * Additionally, previous works [1] [2] have revealed robustness issues with LLM-as-a-judge frameworks, and similar biases could lead to robustness issues of the proposed evaluation f
1. This paper addresses an important problem of evaluating LLMs at human simulation. 2. The scaled-up evaluation protocol of TwinVoice provides more comprehensive information on the human-simulation performance of LLMs.
1. The paper does not provide evidence of the value this new benchmark adds over the existing benchmarks. It is unclear whether the findings in this paper could have been obtained using existing benchmarks and, if not, which component of this benchmark enabled it. 2. There is no clear rationale behind the choices of six capabilities. They seem to overlap, and not all of them will be relevant at every turn. For example, persona tone will correlate with both lexical and syntactic choices. And not
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
