RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
Kaiyue Wen, Xingyu Dang, Kaifeng Lyu

TL;DR
This paper analyzes the limitations of RNNs compared to Transformers in solving algorithmic problems, showing that RNNs lack the retrieval capacity needed for certain tasks but can be enhanced to match Transformers' capabilities.
Contribution
The paper provides a theoretical analysis of RNNs' limitations and demonstrates how retrieval techniques and minimal modifications can bridge the gap with Transformers.
Findings
RNNs cannot solve tasks requiring perfect context retrieval.
Transformers can solve associative recall and graph problems easily.
Enhancing RNNs with retrieval methods closes the performance gap.
Abstract
This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Linear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Label Smoothing · Adam · Focus
