Sample Complexity and Representation Ability of Test-time Scaling Paradigms

Baihe Huang; Shanda Li; Tianhao Wu; Yiming Yang; Ameet Talwalkar; Kannan Ramchandran; Michael I. Jordan; Jiantao Jiao

arXiv:2506.05295·cs.LG·June 13, 2025

Sample Complexity and Representation Ability of Test-time Scaling Paradigms

Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of test-time scaling methods for large language models, revealing their sample efficiency differences and demonstrating how self-correction enhances multi-task capabilities.

Contribution

It establishes sample complexity bounds for self-consistency and best-of-$n$ strategies, and shows how self-correction enables Transformers to perform multi-task learning at test time.

Findings

01

Self-consistency requires $ heta(1/ riangle^2)$ samples, while best-of-$n$ needs $ heta(1/ riangle)$.

02

Self-correction with verifier feedback allows Transformers to simulate online learning.

03

Empirical validation confirms the effectiveness of self-correction methods.

Abstract

Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of- $n$ , and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $Θ (1/ Δ^{2})$ samples to produce the correct answer, while best-of- $n$ only needs $Θ (1/Δ)$ , where $Δ < 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- This work provides a solid theoretical basis for two widely-used but poorly-understood practical heuristics (BofN vs. Self-consistency). The $\Theta(1/\Delta)$ vs. $\Theta(1/\Delta^2)$ separation result is clear, important, and appears fundamental. - The framework of a "General-Purpose Transformer" and "test-time online learning" is a novel perspective. The proof that a Transformer architecture can (by construction) implement regret-minimizing online learning is a significant extension of Tra

Weaknesses

- The theoretical construction of the "General-Purpose Transformer" (Propositions 4.2, 4.4) appears highly complex and relies on a specific "Generalized Position Encoder" (Definition 2.2) and attention sink techniques. This feels more like an existence proof (i.e., "we can construct a Transformer that does this") rather than an explanation of how existing LLMs might learn this behavior through standard pre-training. - The proof of self-correction's representation ability relies on a non-standar

Reviewer 02Rating 6Confidence 4

Strengths

- The paper connects strands across CoT scaling and verification and makes a clear theoretical contribution on sampling and self‑correction. - The paper has good technical depth and the mathematical statements/proofs are rigorous with matching upper/lower bounds, together with complementary experiments. - It's also interesting to have that general‑purpose Transformer constructions manage to route to the correct expert in far less than $K$ trials, which is equivalent to brute‑force trials.

Weaknesses

- The separation results assume a perfect reward for best‑of‑n, the theory does not capture the settings with noisy/imperfect verification. - The unified construction of transformer using experts is already engineered to convey the claim that transformer does online learning over a pool of experts with verification, so the conclusion feels built‑in. If it was the other way around (i.e., inductive bias of trained transformer on forming experts), the story would be more convincing. - It would be g

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper provides a novel theoretical framework to analyze the sample complexity of Test-Time Scaling (TTS) methods, establishing a clear separation result between self-consistency ($\Theta(1/\Delta^2)$) and best-of-n ($\Theta(1/\Delta)$). 2. It offers a new perspective on self-correction, proving its representational power to enable a single Transformer to simulate online learning over a pool of experts (Bandit problem) at test time, thus extending the theory of Transformers from sin

Weaknesses

1. The paper lacks a unified motivation, splitting into two seemingly disconnected parts: the sample complexity analysis of repeated sampling methods (self-consistency and best-of-n) and the theoretical analysis of Self-Correction with Verifier Feedback. The connection between the two main results is not clearly established. 2. The analysis of Self-Correction with Verifier Feedback relies on the existence of an _accurate_ verifier, which is generally not a realistic assumption for a test-

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer