LoTR: Low Tensor Rank Weight Adaptation

Daniel Bershatsky; Daria Cherniuk; Talgat Daulbaev; Aleksandr Mikhalev; and Ivan Oseledets

arXiv:2402.01376·cs.CL·February 6, 2024·1 cites

LoTR: Low Tensor Rank Weight Adaptation

Daniel Bershatsky, Daria Cherniuk, Talgat Daulbaev, Aleksandr Mikhalev, and Ivan Oseledets

PDF

Open Access

TL;DR

LoTR introduces a tensor decomposition-based method for parameter-efficient fine-tuning of large language models, outperforming LoRA especially in deep models by sharing tensor components across layers.

Contribution

It proposes a novel tensor-based low-rank adaptation method for LLMs, enabling more efficient and scalable fine-tuning compared to existing matrix-based approaches.

Findings

01

LoTR achieves better parameter efficiency than LoRA.

02

Tensor sharing across layers improves fine-tuning speed.

03

Core tensor size can be arbitrarily small for efficiency.

Abstract

In this paper we generalize and extend an idea of low-rank adaptation (LoRA) of large language models (LLMs) based on Transformer architecture. Widely used LoRA-like methods of fine-tuning LLMs are based on matrix factorization of gradient update. We introduce LoTR, a novel approach for parameter-efficient fine-tuning of LLMs which represents a gradient update to parameters in a form of tensor decomposition. Low-rank adapter for each layer is constructed as a product of three matrices, and tensor structure arises from sharing left and right multipliers of this product among layers. Simultaneous compression of a sequence of layers with low-rank tensor representation allows LoTR to archive even better parameter efficiency then LoRA especially for deep models. Moreover, the core tensor does not depend on original weight dimension and can be made arbitrary small, which allows for extremely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Layer Normalization · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing