Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in   Language Models' Computability and a Revisit of Recurrent Transformer

Xiang Zhang; Muhammad Abdul-Mageed; Laks V.S. Lakshmanan

arXiv:2409.09239·cs.CL·September 24, 2024

Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer

Xiang Zhang, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan

PDF

Open Access

TL;DR

This paper explores how recurrence enhances language models' reasoning and computational abilities, analyzing the role of autoregression and Chain of Thought prompting in bridging the gap between Transformers and recurrent models.

Contribution

It introduces the concept of recurrence-completeness, investigates the role of recurrence in reasoning, and revisits recurrent Transformer designs to identify their computational limitations.

Findings

01

Recurrent structures improve reasoning and computational power in language models.

02

Chain of Thought prompting can mimic recurrence, enhancing model capabilities.

03

Certain recurrent Transformer models like Linear Transformer face fundamental limitations.

Abstract

The Transformer architecture excels in a variety of language modeling tasks, outperforming traditional neural architectures such as RNN and LSTM. This is partially due to its elimination of recurrent connections, which allows for parallel training and a smoother flow of gradients. However, this move away from recurrent structures places the Transformer model at the lower end of Chomsky's computational hierarchy, imposing limitations on its computational abilities. Consequently, even advanced Transformer-based models face considerable difficulties in tasks like counting, string reversal, and multiplication. These tasks, though seemingly elementary, require a level of computational complexity that exceeds the capabilities of the Transformer architecture. Concurrently, the emergence of ``Chain of Thought" (CoT) prompting has enabled Transformer-based language models to tackle tasks that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsAttention Is All You Need · Tanh Activation · Sigmoid Activation · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Long Short-Term Memory · Layer Normalization · Dropout