Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Yu Huang; Zixin Wen; Aarti Singh; Yuejie Chi; Yuxin Chen

arXiv:2511.07378·cs.LG·November 11, 2025

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

PDF

Open Access

TL;DR

This paper provides a theoretical analysis demonstrating that transformers can learn chain-of-thought reasoning with length generalization, supported by experiments, and introduces a recursive self-training scheme to extend reasoning length.

Contribution

The work offers the first optimization guarantee that constant-depth transformers can learn complex problems with chain-of-thought reasoning and characterizes the role of attention in length generalization.

Findings

01

Transformers can extrapolate learned reasoning patterns to longer tasks.

02

Recursive self-training extends the reasoning length of transformers.

03

Experiments confirm the theoretical predictions about attention concentration and length generalization.

Abstract

The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks