Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers
Zhuolin Fu

TL;DR
This paper interprets Transformers as dense Expectation-Maximization algorithms on Bayesian Nets and introduces VLoRA, a new model design that significantly reduces parameters while maintaining performance.
Contribution
The paper proposes Vertical LoRA (VLoRA), a novel paradigm that combines dense EM interpretation with LoRA decomposition to drastically cut parameters in Transformer models.
Findings
VLoRA reduces Transformer parameters dramatically.
VLoRA preserves original model performance.
VLoRA is compatible with existing LoRA methods.
Abstract
In this paper, we show how Transformers can be interpreted as dense Expectation-Maximization algorithms performed on Bayesian Nets. Based on the above interpretation, we propose a new model design paradigm, namely Vertical LoRA (VLoRA), which reduces the parameter count dramatically while preserving performance. In VLoRA, a model consists of layers, each of which recursively learns an increment based on the previous layer. We then apply LoRA decomposition to the increments. VLoRA works on the base model, which is orthogonal to LoRA, meaning they can be used together. We do experiments on various tasks and models. The results show that 1) with VLoRA, the Transformer model parameter count can be reduced dramatically and 2) the performance of the original model is preserved. The source code is available at \url{https://github.com/neverUseThisName/vlora}
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsResidual Connection · Softmax · Balanced Selection · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention
