Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
Steve Hanneke, Idan Mehalel, Shay Moran

TL;DR
This paper analyzes how the sample complexity of autoregressive models depends on generation length and supervision type, revealing that Chain-of-Thought supervision can eliminate length dependence.
Contribution
It provides a comprehensive taxonomy of sample complexity scaling with generation length and demonstrates the benefits of Chain-of-Thought supervision.
Findings
Sample complexity under End-to-End supervision can grow from constant to linear with generation length.
Chain-of-Thought supervision makes sample complexity independent of generation length.
The analysis introduces new combinatorial tools and resolves open questions about learnability and supervision.
Abstract
Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for steps, producing a chain of tokens whose final token constitutes the model's output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length , and how much Chain-of-Thought…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
