Separations in the Representational Capabilities of Transformers and   Recurrent Architectures

Satwik Bhattamishra; Michael Hahn; Phil Blunsom; Varun Kanade

arXiv:2406.09347·cs.LG·June 14, 2024

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade

PDF

Open Access 1 Video

TL;DR

This paper compares the representational capabilities of Transformers and RNNs across various tasks, revealing size-based separations and demonstrating the efficiency of Transformers for certain decision and recognition tasks.

Contribution

It provides a theoretical analysis of the differences in capabilities between Transformers and RNNs, including size requirements for specific tasks, supported by experimental validation.

Findings

01

Transformers can perform index lookup with logarithmic width, RNNs require linear size.

02

Constant-size RNNs recognize bounded Dyck languages, Transformers need linear size.

03

Two-layer Transformers of logarithmic size perform decision tasks efficiently.

Abstract

Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Separations in the Representational Capabilities of Transformers and Recurrent Architectures· slideslive

Taxonomy

TopicsModular Robots and Swarm Intelligence · Architecture and Computational Design

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer