Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Yi Liu

TL;DR
This paper analyzes how repeated calls to large language models improve accuracy, using a theoretical framework based on moments of correctness and majority voting, with practical experiments on benchmark datasets.
Contribution
It introduces a moment-based theoretical approach to quantify the benefits of repeated LLM inference and derives exact bounds for majority voting accuracy with limited calls.
Findings
Two labeled calls identify success probability and correctness correlation.
Three votes have a closed-form accuracy bound with a width of at most 1/8.
Empirical results show multi-vote accuracies align with theoretical bounds and are affected by temperature and model mixtures.
Abstract
Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
