Sample Complexity and Representation Ability of Test-time Scaling Paradigms
Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao

TL;DR
This paper provides a theoretical analysis of test-time scaling methods for large language models, revealing their sample efficiency differences and demonstrating how self-correction enhances multi-task capabilities.
Contribution
It establishes sample complexity bounds for self-consistency and best-of-$n$ strategies, and shows how self-correction enables Transformers to perform multi-task learning at test time.
Findings
Self-consistency requires $ heta(1/ riangle^2)$ samples, while best-of-$n$ needs $ heta(1/ riangle)$.
Self-correction with verifier feedback allows Transformers to simulate online learning.
Empirical validation confirms the effectiveness of self-correction methods.
Abstract
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires samples to produce the correct answer, while best-of- only needs , where denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably…
Peer Reviews
Decision·ICLR 2026 Poster
- This work provides a solid theoretical basis for two widely-used but poorly-understood practical heuristics (BofN vs. Self-consistency). The $\Theta(1/\Delta)$ vs. $\Theta(1/\Delta^2)$ separation result is clear, important, and appears fundamental. - The framework of a "General-Purpose Transformer" and "test-time online learning" is a novel perspective. The proof that a Transformer architecture can (by construction) implement regret-minimizing online learning is a significant extension of Tra
- The theoretical construction of the "General-Purpose Transformer" (Propositions 4.2, 4.4) appears highly complex and relies on a specific "Generalized Position Encoder" (Definition 2.2) and attention sink techniques. This feels more like an existence proof (i.e., "we can construct a Transformer that does this") rather than an explanation of how existing LLMs might learn this behavior through standard pre-training. - The proof of self-correction's representation ability relies on a non-standar
- The paper connects strands across CoT scaling and verification and makes a clear theoretical contribution on sampling and self‑correction. - The paper has good technical depth and the mathematical statements/proofs are rigorous with matching upper/lower bounds, together with complementary experiments. - It's also interesting to have that general‑purpose Transformer constructions manage to route to the correct expert in far less than $K$ trials, which is equivalent to brute‑force trials.
- The separation results assume a perfect reward for best‑of‑n, the theory does not capture the settings with noisy/imperfect verification. - The unified construction of transformer using experts is already engineered to convey the claim that transformer does online learning over a pool of experts with verification, so the conclusion feels built‑in. If it was the other way around (i.e., inductive bias of trained transformer on forming experts), the story would be more convincing. - It would be g
1. The paper provides a novel theoretical framework to analyze the sample complexity of Test-Time Scaling (TTS) methods, establishing a clear separation result between self-consistency ($\Theta(1/\Delta^2)$) and best-of-n ($\Theta(1/\Delta)$). 2. It offers a new perspective on self-correction, proving its representational power to enable a single Transformer to simulate online learning over a pool of experts (Bandit problem) at test time, thus extending the theory of Transformers from sin
1. The paper lacks a unified motivation, splitting into two seemingly disconnected parts: the sample complexity analysis of repeated sampling methods (self-consistency and best-of-n) and the theoretical analysis of Self-Correction with Verifier Feedback. The connection between the two main results is not clearly established. 2. The analysis of Self-Correction with Verifier Feedback relies on the existence of an _accurate_ verifier, which is generally not a realistic assumption for a test-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing
MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer
