Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
Yash Ganpat Sawant

TL;DR
This paper introduces a theory-grounded evaluation method for LLM personalization based on authorship verification, revealing an authorship gap and inconsistencies in existing metrics.
Contribution
It applies authorship verification theory to evaluate LLM personalization, providing calibrated, meaningful scores and exposing limitations of ad hoc metrics.
Findings
LUAR provides calibrated baselines with a human ceiling of 0.756.
All personalization methods scored below the cross-author floor of 0.626.
Metrics showed near-zero correlation, highlighting evaluation inconsistencies.
Abstract
Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
