NOLA: Compressing LoRA using Linear Combination of Random Basis
Soroush Abbasi Koohpayegani, KL Navaneet, Parsa Nooralinejad, Soheil, Kolouri, Hamed Pirsiavash

TL;DR
NOLA introduces a novel method for compressing LoRA by re-parameterizing low-rank matrices with random basis combinations, enabling significant parameter reduction without sacrificing model performance across various tasks.
Contribution
It overcomes the rank one lower bound in LoRA by re-parameterizing low-rank matrices with random basis combinations, decoupling parameter count from rank and architecture.
Findings
NOLA achieves comparable performance to LoRA with fewer parameters.
On LLaMA-2 70B, NOLA is nearly 20 times more compact than the most compressed LoRA.
NOLA maintains accuracy across language and vision tasks.
Abstract
Fine-tuning Large Language Models (LLMs) and storing them for each downstream task or domain is impractical because of the massive model size (e.g., 350GB in GPT-3). Current literature, such as LoRA, showcases the potential of low-rank modifications to the original weights of an LLM, enabling efficient adaptation and storage for task-specific models. These methods can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude. Yet, these methods face two primary limitations: (1) the parameter count is lower-bounded by the rank one decomposition, and (2) the extent of reduction is heavily influenced by both the model architecture and the chosen rank. We introduce NOLA, which overcomes the rank one lower bound present in LoRA. It achieves this by re-parameterizing the low-rank matrices in LoRA using linear combinations of randomly generated matrices (basis)…
Peer Reviews
Decision·ICLR 2024 poster
**Leveraging Ideas from Other Papers for Enhanced Parameter Efficiency:** This paper skillfully incorporates concepts from existing research to optimize parameter efficiency. **Achieving Comparable or Superior Performance to NOLA:** This research attains performance levels akin to LoRA while significantly enhancing parameter efficiency.
**Poorly presented results**- The main issue in the presentation of the results lies in their lack of clarity and explanatory depth. Firstly, the results fail to offer any substantial insights into how the method operates, leaving readers without a clear understanding of the underlying mechanisms. Additionally, Tables 1 and 5 are presented as mere lists of numbers without the necessary context or explanation, making it challenging for the audience to derive meaningful conclusions from the data.
1. The authors propose a novel, intuitive, and principled approach to address the problem of task based fine tuning of transformer based models. 2. The proposed approach shows significant reduction in storage overhead without compromising on accuracy across a range of experiments in both language and vision tasks.
1. The technical novelty is relatively minor with the overall idea being a combination of prior works PRANC and NOLA. While this seems enough to provide empirical improvement, the approach itself is not that big of an innovation over prior works. 2. While the prior approach PRANC is directly modified by the authors in this work there are no direct comparisons with it in either the language or vision tasks used to evaluate the proposed approach. There is a comparison of training loss in Section
1. The paper discusses related works in detail and clearly summarizes its own contributions. 2. The paper performs extensive experiments to compare NOLA and existing PEFT solutions. 3. NOLA decouples trainable parameters from the choice of rank and the network architecture.
1. The work may need more rationales upfront to motivate the problems (i.e. the rank one lower bound present in LoRA). Given that mainstream GPUs have tens of GB of memory, it is reasonable to reduce the memory requirements from tens of GB to tens of MB at the expense of model quality through LoRA, as this can indeed reduce resource consumption and greatly reduce LLM transition overhead during inference. However, I don't think it makes much sense to further reduce memory requirements to several
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Softmax · Byte Pair Encoding · Discriminative Fine-Tuning · Dropout · Attention Dropout
