Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue,, Jianyong Wang, Furu Wei

TL;DR
Retentive Network (RetNet) introduces a novel architecture for large language models that combines parallel training, low-cost inference, and efficient long-sequence modeling, serving as a promising successor to Transformers.
Contribution
RetNet presents a new sequence modeling mechanism with three computation paradigms, enabling efficient training and inference while maintaining strong performance.
Findings
RetNet achieves favorable scaling in language modeling.
It enables low-cost, high-throughput inference.
The architecture supports efficient long-sequence processing.
Abstract
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show…
Peer Reviews
Decision·Submitted to ICLR 2024
The experimental results of this approach are compelling. It appears that as model parameters increase beyod 2B, retentitive networks outperform Transformers on language modelling according to perplexity. Inference no longer requires addititional KV cache allowing for O(1) inference cost. RetNet allows for O(N) long-sequence memory complexity by accumulating into a buffer.
Novelty: this is essentially a transformer without the softmax and an added time decay. It just so happens that with scale, it appears that this difference does not hinder RetNet performance. Clarity: The paper gets pretty dense at times affecting readability. Also figure 2b is hard to understand without the code. This paper lacks a broader impacts section, its addition would strengthen the paper. The code for the work appears to be closed source, given the overwhelmingly positive results..
The RetNet is well-motivated and the equations are clear. The experiments are conducted with relatively large models.
There are several serious concerns about this paper: 1. Table 1 is mis-leading. If I understand correctly, the recurrent state $s_n$ in Eq (1) is in the shape of $d\times d$, where $d$ is the model dimension and is very large in practice (sometimes even larger than $N$). In S4 or other SSMs, the shape of the recurrent hidden state is $h\times d$ with relatively small $h$, e.g. $h=32$. However, in Table 1 the authors claimed the inference cost of RetNet is $O(1)$. 2. Table 3 is unclear. If I un
- The proposed retention mechanism has a dual form of recurrence and parallelism. - The paper is well written and formalized properly.
- The paper is not clear to the reviewer why these two forms in Figure 2 (a) and Figure 2 (b) are equivalent. - Are there any theoretical proof that retention is more capable than full attention? - The paper is not clear why in Figure 3, RetNet is more effective in the large model regime. According to some prior work [1][2], two model architectures should not cross over when scaling the model up in log scale. - Results are only provided on classification and summarization tasks, not genera
Code & Models
- 🤗parsee-mizuhashi/retnetmodel· ♡ 2♡ 2
- 🤗jploski/retnet-mini-shakespearemodel· 15 dl· ♡ 1015 dl♡ 10
- 🤗wac81/toy_retnet_1.3b_pretrainmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗wac81/toy_retnet_1.3bmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗danangwijaya/IndoRetNet-Liputan6model· 4 dl4 dl
- 🤗umuthopeyildirim/fin-rwkv-169Mmodel· 165 dl165 dl
- 🤗umuthopeyildirim/fin-rwkv-1b5model· 6 dl6 dl
- 🤗umuthopeyildirim/fin-rwkv-430mmodel· 8 dl8 dl
- 🤗NucleusAI/RetNet-410m-XATLmodel· 11 dl· ♡ 211 dl♡ 2
- 🤗Spiral-AI/Spiral-RetNet-3b-basemodel· 7 dl· ♡ 57 dl♡ 5
Videos
Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization
