Transformers meet Neural Algorithmic Reasoners

Wilfried Bounsi; Borja Ibarz; Andrew Dudzik; Jessica B. Hamrick,; Larisa Markeeva; Alex Vitvitskyi; Razvan Pascanu; Petar Veli\v{c}kovi\'c

arXiv:2406.09308·cs.CL·June 14, 2024·3 cites

Transformers meet Neural Algorithmic Reasoners

Wilfried Bounsi, Borja Ibarz, Andrew Dudzik, Jessica B. Hamrick,, Larisa Markeeva, Alex Vitvitskyi, Razvan Pascanu, Petar Veli\v{c}kovi\'c

PDF

Open Access 3 Reviews

TL;DR

This paper introduces TransNAR, a hybrid model combining Transformers and neural algorithmic reasoners to improve algorithmic reasoning capabilities, achieving significant out-of-distribution performance gains.

Contribution

It presents a novel hybrid architecture with a two-phase training procedure that integrates GNN-based neural algorithmic reasoners into Transformers for enhanced reasoning.

Findings

01

TransNAR outperforms Transformer-only models on CLRS-Text.

02

Significant improvements in out-of-distribution reasoning.

03

Effective integration of GNN-based reasoners with Transformers.

Abstract

Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text datasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The TransNAR approach is well-motivated, and the core concept is clearly articulated. 2. The method achieves competitive performance compared to a vanilla Transformer. 3. The study explores a distilled version of the TransNAR transformer, demonstrating improved out-of-distribution robustness over a transformer trained from scratch without GNN support.

Weaknesses

1. The novelty is somewhat limited, as combining text-based transformers with GNNs via cross-attention is well-explored in previous works, such as [1] and [2]. [1] https://aclanthology.org/Q19-1002.pdf. Semantic Neural Machine Translation Using AMR. [2] https://aclanthology.org/2020.tacl-1.2.pdf. AMR-To-Text Generation with Graph Transformer. 2.Both Figures 4, 5 and 6 are not clear enough. I cannot clearly read them in a print version. Please consider to use at least a table to show th

Reviewer 02Rating 6Confidence 4

Strengths

The paper uses graphs and texts so that the new system can use the natural language inputs as well. The combination leads to significant performance improvements over the baseline systems. While the integration uses cross attention very similar to those in multimodal models, the benefits seem significant.

Weaknesses

The paper is sketchy in covering important technical details. For example, the abstract states a two-phase training procedure is used; yet the main paper does not mention the two-phase training procedure even once. Related to this, the paper describes the system wide issues only from lines 213 to line 223 without mentioning some known issues. For example, the particular implementation of the transNAR uses 6 layers and it is well known beyond two to three layers oversmoothing becomes an issue in

Reviewer 03Rating 3Confidence 4

Strengths

The paper was clear enough for me to understand the main idea. I appreciate that the paper contained a “Limitations” section.

Weaknesses

The paper proposes a very straight-forward idea and presents unsurprising results. Of course a hybrid of a general-purpose model (transformer) and a task-specific model (NAR) will perform better on the specific task. Especially when the model requires special graph-structured data, which is the case for TransNAR. The distillation results were hard for me to process and verify. The error bars (constructed on just 3 samples!) are very large. An obvious baseline is missing: distilling a NAR model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Graph Neural Network · Attention Is All You Need · Linear Layer · Multi-Head Attention