PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

Meiling Tao; Chenghao Zhu; Dongyi Ding; Tiannan Wang; Yuchen Eleanor Jiang; Wangchunshu Zhou

arXiv:2506.12915·cs.CL·June 17, 2025

PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

PersonaFeedback is a large-scale benchmark designed to evaluate LLMs' ability to generate personalized responses based on explicit user personas, addressing a critical gap in personalization evaluation tools.

Contribution

The paper introduces PersonaFeedback, a new benchmark with human-annotated test cases that decouples persona inference from response generation for more focused evaluation.

Findings

01

State-of-the-art LLMs struggle on complex personalization tasks.

02

Even humans find subtle distinctions challenging in hard cases.

03

Retrieval-augmented methods are not a definitive solution for personalization.

Abstract

With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. This paper decomposes the personalization problem into persona inference and persona conditioned generation, which the reviewer thought was an interesting proposal.

Weaknesses

1. The main concern is over the design of the benchmark, i.e., binary choice evaluation. In essence, the benchmark is measuring the model's ability to recognize personalization (which one of the two responses better reflects a persona), instead of measuring how good the model is at independently generating a high-quality personalized response. Binary discrimination is cognitively and computationally much easier than fluent personalization in open-ended dialogue. This seems a critical distinction

Reviewer 02Rating 6Confidence 3

Strengths

- 8,298 test instances across 200 high-quality personas, with 9 annotators and agreement-based difficulty tiers (easy/medium/hard). This is better than many LLM-only synthetic personalization benchmarks. - They show that just retrieving user info and stuffing it in the prompt doesn’t automatically yield better personalization than giving the model a structured persona. This is an important message for practitioners. - The related-work section actually explains what earlier persona datasets don’

Weaknesses

The task is discriminative (“which answer is more persona-consistent?”), not generative (“write a persona-consistent answer”). That means you can do well with a good reranker without proving you can produce personalized outputs. This narrows what “success” means here. The result “RAG doesn’t help” will be controversial unless they show stronger memory structuring or persona-summarization RAG. Right now it mostly rules out naive RAG.

Reviewer 03Rating 6Confidence 3

Strengths

1. **Important and Practical Issue:** Personalization is an important direction in LLM but lacks high-quality benchmarks; the angle of decoupling persona inference and personalization generation is quite novel and indeed an overlooked problem. 2. **Solid Data Collection:** All 8298 samples were manually labeled, with a majority of 9 labelers voting, and consistency was quantified using Fleiss's Kappa; difficulty stratification was statistically based; multiple rounds of filtering ensured qualit

Weaknesses

1. **Limited Methodological Innovation** The theoretical contributions are insufficient, mainly relying on empirical studies. Binary choice is not a new method, and the benchmark itself lacks methodological innovation. 2. **Limitations of the Evaluation Method** Binary choices are oversimplified. Personalization is often not black and white; the two answers may simply differ in degree or perspective. Forcing a choice between them is unreasonable. Moreover, it only uses accuracy as an indicator,

Code & Models

Datasets

PersonalAILab/PersonaFeedback
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · Innovative Human-Technology Interaction · Technology Use by Older Adults