Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer
Xiang Zhang, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan

TL;DR
This paper explores how recurrence enhances language models' reasoning and computational abilities, analyzing the role of autoregression and Chain of Thought prompting in bridging the gap between Transformers and recurrent models.
Contribution
It introduces the concept of recurrence-completeness, investigates the role of recurrence in reasoning, and revisits recurrent Transformer designs to identify their computational limitations.
Findings
Recurrent structures improve reasoning and computational power in language models.
Chain of Thought prompting can mimic recurrence, enhancing model capabilities.
Certain recurrent Transformer models like Linear Transformer face fundamental limitations.
Abstract
The Transformer architecture excels in a variety of language modeling tasks, outperforming traditional neural architectures such as RNN and LSTM. This is partially due to its elimination of recurrent connections, which allows for parallel training and a smoother flow of gradients. However, this move away from recurrent structures places the Transformer model at the lower end of Chomsky's computational hierarchy, imposing limitations on its computational abilities. Consequently, even advanced Transformer-based models face considerable difficulties in tasks like counting, string reversal, and multiplication. These tasks, though seemingly elementary, require a level of computational complexity that exceeds the capabilities of the Transformer architecture. Concurrently, the emergence of ``Chain of Thought" (CoT) prompting has enabled Transformer-based language models to tackle tasks that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsAttention Is All You Need · Tanh Activation · Sigmoid Activation · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Long Short-Term Memory · Layer Normalization · Dropout
