Random Initialization of Gated Sparse Adapters
Vi Retault, Yoha\"i-Eliel Berreby

TL;DR
This paper introduces RIGSA, a novel sparse adapter method for fine-tuning language models that starts from random initialization, gates, and sparsifies adapters, showing reduced forgetting on certain tasks compared to existing methods.
Contribution
RIGSA is a new approach combining random initialization, gating, and iterative pruning for sparse adapters, improving task retention during fine-tuning.
Findings
RIGSA reduces forgetting more than QLoRA on GSM8k.
RIGSA performs comparably to random masking.
RIGSA can learn new tasks from chance performance.
Abstract
When fine-tuning language models on new tasks, catastrophic forgetting -- performance degradation on previously-learned tasks -- is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn't impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis
