Exploring Efficient Learning of Small BERT Networks with LoRA and DoRA
Daniel Frees, Aditri Bhagirath, Moritz Bolling

TL;DR
This paper evaluates the efficiency and performance of LoRA and DoRA methods on a small BERT model, minBERT, demonstrating significant training speedups and low-rank gradient updates, enabling multitask learning.
Contribution
It extends LoRA and DoRA to small models, benchmarking their effectiveness and revealing low-rank properties in gradient updates even in compact models.
Findings
Optimal configurations with AMP improve training efficiency.
Low-rank gradient updates are valid even for small models.
Multitask minBERT achieves competitive performance across tasks.
Abstract
While Large Language Models (LLMs) have revolutionized artificial intelligence, fine-tuning LLMs is extraordinarily computationally expensive, preventing smaller businesses and research teams with limited GPU resources from engaging with new research. Hu et al and Liu et al introduce Low-Rank Adaptation (LoRA) and Weight-Decomposed Low-Rank Adaptation (DoRA) as highly efficient and performant solutions to the computational challenges of LLM fine-tuning, demonstrating huge speedups and memory usage savings for models such as GPT-3 and RoBERTa. We seek to expand upon the original LoRA and DoRA papers by benchmarking efficiency and performance of LoRA and DoRA when applied to a much smaller scale of language model: our case study here is the compact minBERT model. Our findings reveal that optimal custom configurations of LoRA and DoRA, coupled with Automatic Mixed Precision (AMP),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
