Relaxed Recursive Transformers: Effective Parameter Sharing with   Layer-wise LoRA

Sangmin Bae; Adam Fisch; Hrayr Harutyunyan; Ziwei Ji; Seungyeon Kim,; Tal Schuster

arXiv:2410.20672·cs.CL·March 3, 2025

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim,, Tal Schuster

PDF

Open Access 1 Models 1 Video

TL;DR

This paper introduces Relaxed Recursive Transformers with layer-wise LoRA, enabling effective parameter sharing that reduces model size and cost while maintaining high performance, and proposes a new inference paradigm for efficiency.

Contribution

It presents novel methods for converting pretrained Transformers into Recursive Transformers with flexible parameter sharing using LoRA modules, improving efficiency and performance.

Findings

01

Recursive models outperform similar-sized vanilla models

02

They recover most of the original model's performance

03

Proposed inference paradigm offers 2-3x throughput gains

Abstract

Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
brianling16/relaxed-recursive-transformer
model· 2 dl
2 dl

Videos

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA· slideslive

Taxonomy

TopicsAdvanced Memory and Neural Computing · Indoor and Outdoor Localization Technologies · Blind Source Separation Techniques

MethodsLinear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam