Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

TL;DR
This paper provides a theoretical analysis demonstrating that transformers can learn chain-of-thought reasoning with length generalization, supported by experiments, and introduces a recursive self-training scheme to extend reasoning length.
Contribution
The work offers the first optimization guarantee that constant-depth transformers can learn complex problems with chain-of-thought reasoning and characterizes the role of attention in length generalization.
Findings
Transformers can extrapolate learned reasoning patterns to longer tasks.
Recursive self-training extends the reasoning length of transformers.
Experiments confirm the theoretical predictions about attention concentration and length generalization.
Abstract
The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
