ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models
Siwei Wang, Yifei Shen, Shi Feng, Haoran Sun, Shang-Hua Teng, Wei Chen

TL;DR
This paper investigates how Transformer-based large language models can develop planning abilities through their next-word prediction mechanism, modeling planning as a path-finding task and analyzing their capacity to learn adjacency and reachability matrices.
Contribution
The paper provides a theoretical framework showing that Transformers can perform path-finding by embedding graph matrices in their weights and learns these matrices through gradient-based training.
Findings
Transformers can embed adjacency and reachability matrices within their weights.
They learn adjacency and limited reachability matrices through training.
Current architectures cannot infer reachability through transitivity, limiting path concatenation.
Abstract
Planning is a crucial element of both human intelligence and contemporary large language models (LLMs). In this paper, we initiate a theoretical investigation into the emergence of planning capabilities in Transformer-based LLMs via their next-word prediction mechanisms. We model planning as a network path-finding task, where the objective is to generate a valid path from a specified source node to a designated target node. Our mathematical characterization shows that Transformer architectures can execute path-finding by embedding the adjacency and reachability matrices within their weights. Furthermore, our theoretical analysis of gradient-based learning dynamics reveals that LLMs can learn both the adjacency and a limited form of the reachability matrices. These theoretical insights are then validated through experiments, which demonstrate that Transformer architectures indeed learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding
