Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Tong Yang; Yu Huang; Yingbin Liang; Yuejie Chi

arXiv:2508.08222·cs.LG·December 9, 2025

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

PDF

Open Access

TL;DR

This paper provides a theoretical analysis demonstrating that trained one-layer multi-head transformers can provably learn symbolic multi-step reasoning tasks, with guarantees of generalization and insights into the emergence of reasoning abilities.

Contribution

It offers the first provable guarantees for how shallow transformers learn multi-step symbolic reasoning via gradient descent, explaining the emergence of reasoning capabilities.

Findings

01

Trained one-layer transformers can solve chain-of-thought reasoning tasks with generalization guarantees.

02

Attention heads learn to specialize and coordinate to perform complex reasoning steps.

03

Shallow multi-head transformers can implement multi-step reasoning typically associated with deeper models.

Abstract

Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Child and Animal Learning Development