A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel

TL;DR
This paper develops an online learning theory for autoregressive chain-of-thought reasoning in large language models, analyzing mistake bounds and the impact of feedback types on learning efficiency.
Contribution
It introduces an online framework for autoregressive learning, characterizes mistake bounds under different feedback models, and resolves open questions from prior work.
Findings
In End-to-End feedback, mistake bounds grow between constant and logarithmic in horizon M.
Access to full trajectories in Chain-of-Thought models removes dependence on M.
Optimal mistake bounds are established for linear threshold classes.
Abstract
Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire -step trajectory. Our goal is to understand how the optimal mistake bound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
