Vertical LoRA: Dense Expectation-Maximization Interpretation of   Transformers

Zhuolin Fu

arXiv:2406.09315·cs.AI·June 14, 2024·2 cites

Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

Zhuolin Fu

PDF

Open Access 1 Repo

TL;DR

This paper interprets Transformers as dense Expectation-Maximization algorithms on Bayesian Nets and introduces VLoRA, a new model design that significantly reduces parameters while maintaining performance.

Contribution

The paper proposes Vertical LoRA (VLoRA), a novel paradigm that combines dense EM interpretation with LoRA decomposition to drastically cut parameters in Transformer models.

Findings

01

VLoRA reduces Transformer parameters dramatically.

02

VLoRA preserves original model performance.

03

VLoRA is compatible with existing LoRA methods.

Abstract

In this paper, we show how Transformers can be interpreted as dense Expectation-Maximization algorithms performed on Bayesian Nets. Based on the above interpretation, we propose a new model design paradigm, namely Vertical LoRA (VLoRA), which reduces the parameter count dramatically while preserving performance. In VLoRA, a model consists of layers, each of which recursively learns an increment based on the previous layer. We then apply LoRA decomposition to the increments. VLoRA works on the base model, which is orthogonal to LoRA, meaning they can be used together. We do experiments on various tasks and models. The results show that 1) with VLoRA, the Transformer model parameter count can be reduced dramatically and 2) the performance of the original model is preserved. The source code is available at \url{https://github.com/neverUseThisName/vlora}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neverUseThisName/vlora
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsResidual Connection · Softmax · Balanced Selection · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention