Transformers Struggle to Learn to Search

Abulhair Saparov; Srushti Pawar; Shreyas Pimpalgaonkar; Nitish Joshi,; Richard Yuanzhe Pang; Vishakh Padmakumar; Seyed Mehran Kazemi; Najoung Kim,; He He

arXiv:2412.04703·cs.CL·March 18, 2025

Transformers Struggle to Learn to Search

Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi,, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim,, He He

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper investigates the ability of transformers to learn search tasks using a graph connectivity problem, revealing that with proper training they can learn to search, but struggle with larger graphs regardless of scale or in-context training.

Contribution

The study introduces a novel interpretability technique to analyze learned algorithms and demonstrates that transformers perform parallel search at each vertex, highlighting limitations in scaling.

Findings

01

Transformers can learn to perform search with the right training distribution.

02

They perform search in parallel at each vertex, expanding reachable sets layer by layer.

03

Scaling to larger graphs remains difficult, even with increased parameters or in-context training.

Abstract

Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that transformers perform search at every vertex in parallel: For each vertex…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

- The study tackles an intriguing and practical research question: understanding the mechanisms behind search capabilities in LLMs. This is not only scientifically interesting but also has meaningful implications for real-world applications. - Training a small GPT model on synthetic graph data is a reasonable and well-justified approach to investigate this research question.

Weaknesses

The logical flow of the paper is weak in several areas. The authors should clarify the connection between their empirical results and the statements made, as well as provide more intuition behind their hypotheses. For example, in line 51, the authors state, "We demonstrate experimentally that transformers can indeed be taught to search, but only under fairly restrictive conditions on the training distribution." However, Figure 3 does not fully support this claim. While it may indicate that the

Reviewer 02Rating 6Confidence 3

Strengths

The topic is interesting and may have influence in the community.

Weaknesses

1. There are potential data leakage issues in the training and testing datasets constructed by the authors: The authors use a generation method to generate training data online and save the first few generated results as test data. While the authors claim they will remove overlapping samples between training and test data, they don't explain how they compare whether two graphs are identical. If only using string matching, it cannot determine whether two graphs are completely equal. For example,

Reviewer 03Rating 8Confidence 3

Strengths

The major strength of the paper lies in its motivation to understand transformer mechanisms in search based tasks. The authors take carefully designed experimental exploration to train transformers on directed acyclic graphs, with careful design discussion on data distribution, and propose a new mechanistic approach to analyze the learned algorithm. The authors discover a message passing algorithm, where the neighborhood information are shared progressively among the vertices, which leads to an

Weaknesses

As such, the paper doesn't have many weaknesses. I have a couple of questions regarding the experimental setup. a) **Sequence length in In-context exploration:** As the experiments require training on higher sequence length, how are the samples in training data distribution decided? How many steps in DFS traces are necessary for the model to learn? If the authors had provided same 'K' padding tokens to the experiments in the experiments in section 4, would the models generalize better? b) **Di

Code & Models

Repositories

asaparov/learning_to_search
pytorchOfficial

Videos

Transformers Struggle to Learn to Search· slideslive

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training