Transformers Struggle to Learn to Search
Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi,, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim,, He He

TL;DR
This paper investigates the ability of transformers to learn search tasks using a graph connectivity problem, revealing that with proper training they can learn to search, but struggle with larger graphs regardless of scale or in-context training.
Contribution
The study introduces a novel interpretability technique to analyze learned algorithms and demonstrates that transformers perform parallel search at each vertex, highlighting limitations in scaling.
Findings
Transformers can learn to perform search with the right training distribution.
They perform search in parallel at each vertex, expanding reachable sets layer by layer.
Scaling to larger graphs remains difficult, even with increased parameters or in-context training.
Abstract
Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that transformers perform search at every vertex in parallel: For each vertex…
Peer Reviews
Decision·ICLR 2025 Poster
- The study tackles an intriguing and practical research question: understanding the mechanisms behind search capabilities in LLMs. This is not only scientifically interesting but also has meaningful implications for real-world applications. - Training a small GPT model on synthetic graph data is a reasonable and well-justified approach to investigate this research question.
The logical flow of the paper is weak in several areas. The authors should clarify the connection between their empirical results and the statements made, as well as provide more intuition behind their hypotheses. For example, in line 51, the authors state, "We demonstrate experimentally that transformers can indeed be taught to search, but only under fairly restrictive conditions on the training distribution." However, Figure 3 does not fully support this claim. While it may indicate that the
The topic is interesting and may have influence in the community.
1. There are potential data leakage issues in the training and testing datasets constructed by the authors: The authors use a generation method to generate training data online and save the first few generated results as test data. While the authors claim they will remove overlapping samples between training and test data, they don't explain how they compare whether two graphs are identical. If only using string matching, it cannot determine whether two graphs are completely equal. For example,
The major strength of the paper lies in its motivation to understand transformer mechanisms in search based tasks. The authors take carefully designed experimental exploration to train transformers on directed acyclic graphs, with careful design discussion on data distribution, and propose a new mechanistic approach to analyze the learned algorithm. The authors discover a message passing algorithm, where the neighborhood information are shared progressively among the vertices, which leads to an
As such, the paper doesn't have many weaknesses. I have a couple of questions regarding the experimental setup. a) **Sequence length in In-context exploration:** As the experiments require training on higher sequence length, how are the samples in training data distribution decided? How many steps in DFS traces are necessary for the model to learn? If the authors had provided same 'K' padding tokens to the experiments in the experiments in section 4, would the models generalize better? b) **Di
Code & Models
Videos
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
