Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim,, Tal Schuster

TL;DR
This paper introduces Relaxed Recursive Transformers with layer-wise LoRA, enabling effective parameter sharing that reduces model size and cost while maintaining high performance, and proposes a new inference paradigm for efficiency.
Contribution
It presents novel methods for converting pretrained Transformers into Recursive Transformers with flexible parameter sharing using LoRA modules, improving efficiency and performance.
Findings
Recursive models outperform similar-sized vanilla models
They recover most of the original model's performance
Proposed inference paradigm offers 2-3x throughput gains
Abstract
Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Memory and Neural Computing · Indoor and Outdoor Localization Technologies · Blind Source Separation Techniques
MethodsLinear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam
