Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

Zihan Dong; Zhixian Zhang; Yang Zhou; Can Jin; Ruijia Wu; and Linjun Zhang

arXiv:2602.03061·cs.LG·February 4, 2026

Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

Zihan Dong, Zhixian Zhang, Yang Zhou, Can Jin, Ruijia Wu, and Linjun Zhang

PDF

Open Access

TL;DR

This paper introduces a statistically efficient evaluation framework for mathematical reasoning in large language models, leveraging pairwise comparison signals to improve accuracy and stability of model rankings, especially with limited data.

Contribution

It develops a semiparametric estimator using control variates from pairwise comparisons, achieving variance reduction and more reliable model evaluation in small-sample settings.

Findings

01

Significant variance reduction over naive averaging.

02

Improved ranking accuracy with noisy outputs.

03

Enhanced evaluation stability in small samples.

Abstract

Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Mathematics, Computing, and Information Processing · Constraint Satisfaction and Optimization