TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du; Minghao Guo; Songming He; Ziyi Ye; Xi Zhu; Weihang Su; Shuqi Zhu; Yujia Zhou; Yongfeng Zhang; Qingyao Ai; Yiqun Liu

arXiv:2510.25536·cs.CL·October 31, 2025

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu

PDF

4 Reviews

TL;DR

TwinVoice is a comprehensive benchmark designed to evaluate LLMs' ability to simulate diverse human personas across social, private, and narrative contexts, highlighting current limitations in syntactic style and memory recall.

Contribution

The paper introduces TwinVoice, a multi-dimensional benchmark for systematic assessment of LLM-based persona simulation in real-world scenarios.

Findings

01

Advanced LLMs achieve moderate accuracy in persona simulation

02

Models struggle with syntactic style and memory recall

03

Performance remains below human baseline

Abstract

Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The paper could provide a framework for detailed analysis on digital twin (or human persona replication). - The paper provides some discussion about the strength and weakness of present LLMs. - The paper tested multiple models, which strengthens its generalizability.

Weaknesses

- The experimental method has issues. The selection of language models is not systematic (and there's no reason specified in the paper), and the judge model overlaps with the generation model. The agreement measurement is not proper. See Question A. - The paper's review on the previous benchmark or papers is somewhat shallow. Other researchers have been reported similar findings, especially on memory recall and persona consistency. Though the paper aims to be a unified framework, the paper shoul

Reviewer 02Rating 2Confidence 5

Strengths

Please see the weakness section.

Weaknesses

**Significant ICLR formatting violation** This submission appears to use 1.0 inch left / right margins, significantly below the regulation of 1.5 inch (“Formatting instructions for ICLR 2026 conference submissions,” Line 30, Line 50). This expands the text width from the mandated 5.5 inch to 5.5 + 0.5 * 2 = 6.5 inch, so 9 pages × (6.5 / 5.5) = 10.64 pages of effective content, exceeding the strict 9-page limit (“At the time of submission, the main text should be 9 pages or fewer… This limit wil

Reviewer 03Rating 6Confidence 4

Strengths

* The proposed TwinVoice benchmark is more comprehensive than previous works: * TwinVoice has over 4,500 personas for evaluation, exceeding the size of prior works. The persona dataset is dissected into 3 categories: social persona, interpersonal persona, and narrative persona. This improves the diversity and robustness of persona fidelity evaluation. * TwinVoice consists of both real-world and synthetic data, resolving the issue with dominating synthetic data usage in existing works. * The

Weaknesses

* While I appreciate the authors for conducting a human study on the proposed LLM-as-a-judge evaluation framework, I still have minor concerns about the pipeline's robustness: * First, for the human verification scale, 50 items per judging mode (100 total) for such a big evaluation benchmark might not be enough. * Additionally, previous works [1] [2] have revealed robustness issues with LLM-as-a-judge frameworks, and similar biases could lead to robustness issues of the proposed evaluation f

Reviewer 04Rating 2Confidence 3

Strengths

1. This paper addresses an important problem of evaluating LLMs at human simulation. 2. The scaled-up evaluation protocol of TwinVoice provides more comprehensive information on the human-simulation performance of LLMs.

Weaknesses

1. The paper does not provide evidence of the value this new benchmark adds over the existing benchmarks. It is unclear whether the findings in this paper could have been obtained using existing benchmarks and, if not, which component of this benchmark enabled it. 2. There is no clear rationale behind the choices of six capabilities. They seem to overlap, and not all of them will be relevant at every turn. For example, persona tone will correlate with both lexical and syntactic choices. And not

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.