Separations in the Representational Capabilities of Transformers and Recurrent Architectures
Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade

TL;DR
This paper compares the representational capabilities of Transformers and RNNs across various tasks, revealing size-based separations and demonstrating the efficiency of Transformers for certain decision and recognition tasks.
Contribution
It provides a theoretical analysis of the differences in capabilities between Transformers and RNNs, including size requirements for specific tasks, supported by experimental validation.
Findings
Transformers can perform index lookup with logarithmic width, RNNs require linear size.
Constant-size RNNs recognize bounded Dyck languages, Transformers need linear size.
Two-layer Transformers of logarithmic size perform decision tasks efficiently.
Abstract
Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsModular Robots and Swarm Intelligence · Architecture and Computational Design
MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
