SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Kaiye Zhou; Shucheng Wang; Jun Xu

arXiv:2406.06564·cs.LG·January 3, 2025·1 cites

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Kaiye Zhou, Shucheng Wang, Jun Xu

PDF

Open Access 3 Reviews

TL;DR

SwitchLoRA is a novel training method that incrementally replaces low-rank parameters to better approximate full-rank training, improving language model accuracy and efficiency.

Contribution

It introduces a switching mechanism for low-rank adaptation that enhances accuracy and efficiency, surpassing full-rank training in language model pre-training.

Findings

01

Reduces perplexity from 15.23 to 15.01 on LLaMA 1.3B.

02

Cuts communication overhead by 54%.

03

Achieves 1% higher accuracy on GLUE after fine-tuning.

Abstract

In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

- In empirical tests, SwitchLoRA demonstrates performance improvements, achieving lower perplexity than full-rank training, especially on the LLaMA 1.3B model. - Despite frequent updates, SwitchLoRA keeps computational and memory overhead low by using pre-trained candidate vectors. - When fine-tuned on GLUE tasks, SwitchLoRA shows a slight improvement in accuracy over full-rank models, indicating enhanced generalization.

Weaknesses

- Dynamic parameter adjustment impedes scalability for very large models or environments with limited resources due to additional overhead and computational costs of scaling factors. - Broader applicability is limited since the paper primarily evaluates SwitchLoRA within language tasks, leaving its performance and adaptability in other domains. - SwitchLoRA assumes that task-appropriate configurations can be achieved simply by adjusting scaling factors on existing model parameters. While effecti

Reviewer 02Rating 5Confidence 3

Strengths

- The overall switching methodology - selecting candidate vectors to reset the optimizer states - is novel and enables the use of high switching frequencies. - The evaluations and experiments against LoRA and full-rank training are extensive and clearly show the benefits of using SwitchLoRA against them. - The proposed method maintains performance against full-rank training while reducing the number of trainable parameters to 50-60% to full-rank training, with minimal communication overhead.

Weaknesses

- The paper claims that high intervals between reset/update steps in ReLoRA and GaLore are needed to avoid inconsistency in optimizer states, which otherwise wouldn't approximate full-rank training well. SwitchLoRA, on the other hand, uses a default highest switching frequency of 40, which then decays exponentially. GaLore reports that this frequency is close to optimal and does not cause issues for them. The core motivation presented in the paper is GaLore's inability to handle high switching f

Reviewer 03Rating 6Confidence 3

Strengths

The technique appears to offer significant gains over previous approaches. It achieves similar levels of accuracy to full-rank training with only 50-60% of the trainable parameters. The idea seems to intutively make sense.

Weaknesses

It is currently a little unclear to me if the approach would scale or not to larger models. Could you detail the memory and compute implications of training larger models in more detail please. Can you extropolate from your current experiments to give us more confidence of the scalability of the approach?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLLaMA