LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency

Jiachun Li; David Simchi-Levi; Will Wei Sun

arXiv:2604.05460·stat.ME·April 8, 2026

LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency

Jiachun Li, David Simchi-Levi, Will Wei Sun

PDF

TL;DR

This paper models LLM evaluation data as a low-rank tensor completion problem under pairwise comparisons, deriving efficient estimators and uncertainty quantification methods for structured, noisy, and non-uniform data.

Contribution

It introduces a semiparametric inference framework with a score-whitening method for stable, optimal uncertainty quantification in low-rank tensor models from pairwise data.

Findings

01

Derived the semiparametric efficiency bound for LLM evaluation tensors.

02

Constructed a one-step debiased estimator with asymptotic normality.

03

Introduced a score-whitening technique to stabilize inference in anisotropic models.

Abstract

Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as semiparametric inference for a low-rank latent score tensor observed through pairwise comparisons under Bradley-Terry-Luce-type models. This places LLM evaluation in a new tensor completion setting with structured observations, non-uniform sampling, and pairwise contrasts. Our target is a smooth functional $ψ (T^{⋆})$ , including linear estimands such as ability gaps and nonlinear ones such as win probabilities. We derive the information operator on the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound, then construct a one-step debiased estimator with asymptotic normality. A central challenge is that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.