Transformers meet Neural Algorithmic Reasoners
Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick,, Larisa Markeeva, Alex Vitvitskyi, Razvan Pascanu, Petar Veli\v{c}kovi\'c

TL;DR
This paper introduces TransNAR, a hybrid model combining Transformers and neural algorithmic reasoners to improve algorithmic reasoning capabilities, achieving significant out-of-distribution performance gains.
Contribution
It presents a novel hybrid architecture with a two-phase training procedure that integrates GNN-based neural algorithmic reasoners into Transformers for enhanced reasoning.
Findings
TransNAR outperforms Transformer-only models on CLRS-Text.
Significant improvements in out-of-distribution reasoning.
Effective integration of GNN-based reasoners with Transformers.
Abstract
Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text datasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The TransNAR approach is well-motivated, and the core concept is clearly articulated. 2. The method achieves competitive performance compared to a vanilla Transformer. 3. The study explores a distilled version of the TransNAR transformer, demonstrating improved out-of-distribution robustness over a transformer trained from scratch without GNN support.
1. The novelty is somewhat limited, as combining text-based transformers with GNNs via cross-attention is well-explored in previous works, such as [1] and [2]. [1] https://aclanthology.org/Q19-1002.pdf. Semantic Neural Machine Translation Using AMR. [2] https://aclanthology.org/2020.tacl-1.2.pdf. AMR-To-Text Generation with Graph Transformer. 2.Both Figures 4, 5 and 6 are not clear enough. I cannot clearly read them in a print version. Please consider to use at least a table to show th
The paper uses graphs and texts so that the new system can use the natural language inputs as well. The combination leads to significant performance improvements over the baseline systems. While the integration uses cross attention very similar to those in multimodal models, the benefits seem significant.
The paper is sketchy in covering important technical details. For example, the abstract states a two-phase training procedure is used; yet the main paper does not mention the two-phase training procedure even once. Related to this, the paper describes the system wide issues only from lines 213 to line 223 without mentioning some known issues. For example, the particular implementation of the transNAR uses 6 layers and it is well known beyond two to three layers oversmoothing becomes an issue in
The paper was clear enough for me to understand the main idea. I appreciate that the paper contained a “Limitations” section.
The paper proposes a very straight-forward idea and presents unsurprising results. Of course a hybrid of a general-purpose model (transformer) and a task-specific model (NAR) will perform better on the specific task. Especially when the model requires special graph-structured data, which is the case for TransNAR. The distillation results were hard for me to process and verify. The error bars (constructed on just 3 samples!) are very large. An obvious baseline is missing: distilling a NAR model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Graph Neural Network · Attention Is All You Need · Linear Layer · Multi-Head Attention
