LoTR: Low Tensor Rank Weight Adaptation
Daniel Bershatsky, Daria Cherniuk, Talgat Daulbaev, Aleksandr Mikhalev, and Ivan Oseledets

TL;DR
LoTR introduces a tensor decomposition-based method for parameter-efficient fine-tuning of large language models, outperforming LoRA especially in deep models by sharing tensor components across layers.
Contribution
It proposes a novel tensor-based low-rank adaptation method for LLMs, enabling more efficient and scalable fine-tuning compared to existing matrix-based approaches.
Findings
LoTR achieves better parameter efficiency than LoRA.
Tensor sharing across layers improves fine-tuning speed.
Core tensor size can be arbitrarily small for efficiency.
Abstract
In this paper we generalize and extend an idea of low-rank adaptation (LoRA) of large language models (LLMs) based on Transformer architecture. Widely used LoRA-like methods of fine-tuning LLMs are based on matrix factorization of gradient update. We introduce LoTR, a novel approach for parameter-efficient fine-tuning of LLMs which represents a gradient update to parameters in a form of tensor decomposition. Low-rank adapter for each layer is constructed as a product of three matrices, and tensor structure arises from sharing left and right multipliers of this product among layers. Simultaneous compression of a sequence of layers with low-rank tensor representation allows LoTR to archive even better parameter efficiency then LoRA especially for deep models. Moreover, the core tensor does not depend on original weight dimension and can be made arbitrary small, which allows for extremely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Layer Normalization · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing
