Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun; Li Dong; Shaohan Huang; Shuming Ma; Yuqing Xia; Jilong Xue,; Jianyong Wang; Furu Wei

arXiv:2307.08621·cs.CL·August 10, 2023·107 cites

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue,, Jianyong Wang, Furu Wei

PDF

Open Access 5 Repos 10 Models 1 Datasets 1 Video 3 Reviews

TL;DR

Retentive Network (RetNet) introduces a novel architecture for large language models that combines parallel training, low-cost inference, and efficient long-sequence modeling, serving as a promising successor to Transformers.

Contribution

RetNet presents a new sequence modeling mechanism with three computation paradigms, enabling efficient training and inference while maintaining strong performance.

Findings

01

RetNet achieves favorable scaling in language modeling.

02

It enables low-cost, high-throughput inference.

03

The architecture supports efficient long-sequence processing.

Abstract

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O (1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

The experimental results of this approach are compelling. It appears that as model parameters increase beyod 2B, retentitive networks outperform Transformers on language modelling according to perplexity. Inference no longer requires addititional KV cache allowing for O(1) inference cost. RetNet allows for O(N) long-sequence memory complexity by accumulating into a buffer.

Weaknesses

Novelty: this is essentially a transformer without the softmax and an added time decay. It just so happens that with scale, it appears that this difference does not hinder RetNet performance. Clarity: The paper gets pretty dense at times affecting readability. Also figure 2b is hard to understand without the code. This paper lacks a broader impacts section, its addition would strengthen the paper. The code for the work appears to be closed source, given the overwhelmingly positive results..

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

The RetNet is well-motivated and the equations are clear. The experiments are conducted with relatively large models.

Weaknesses

There are several serious concerns about this paper: 1. Table 1 is mis-leading. If I understand correctly, the recurrent state $s_n$ in Eq (1) is in the shape of $d\times d$, where $d$ is the model dimension and is very large in practice (sometimes even larger than $N$). In S4 or other SSMs, the shape of the recurrent hidden state is $h\times d$ with relatively small $h$, e.g. $h=32$. However, in Table 1 the authors claimed the inference cost of RetNet is $O(1)$. 2. Table 3 is unclear. If I un

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

- The proposed retention mechanism has a dual form of recurrence and parallelism. - The paper is well written and formalized properly.

Weaknesses

- The paper is not clear to the reviewer why these two forms in Figure 2 (a) and Figure 2 (b) are equivalent. - Are there any theoretical proof that retention is more capable than full attention? - The paper is not clear why in Figure 3, RetNet is more effective in the large model regime. According to some prior work [1][2], two model architectures should not cross over when scaling the model up in log scale. - Results are only provided on classification and summarization tasks, not genera

Code & Models

Repositories

Models

Datasets

huaXiaKyrie/up
dataset· 19k dl
19k dl

Videos

Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization