Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

Yash Ganpat Sawant

arXiv:2604.26460·cs.CL·April 30, 2026

Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

Yash Ganpat Sawant

PDF

TL;DR

This paper introduces a theory-grounded evaluation method for LLM personalization based on authorship verification, revealing an authorship gap and inconsistencies in existing metrics.

Contribution

It applies authorship verification theory to evaluate LLM personalization, providing calibrated, meaningful scores and exposing limitations of ad hoc metrics.

Findings

01

LUAR provides calibrated baselines with a human ceiling of 0.756.

02

All personalization methods scored below the cross-author floor of 0.626.

03

Metrics showed near-zero correlation, highlighting evaluation inconsistencies.

Abstract

Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.