Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Ashwinee Panda; Berivan Isik; Xiangyu Qi; Sanmi Koyejo; Tsachy; Weissman; Prateek Mittal

arXiv:2406.16797·cs.CL·June 26, 2024

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy, Weissman, Prateek Mittal

PDF

Open Access 1 Repo 4 Reviews

TL;DR

Lottery Ticket Adaptation (LoTA) introduces a sparse, subnetwork-based method for multi-task adaptation of large language models, reducing interference and catastrophic forgetting while outperforming full fine-tuning and LoRA.

Contribution

LoTA is a novel sparse adaptation technique that identifies and optimizes a subnetwork, enabling effective multi-task learning and model merging in LLMs.

Findings

01

LoTA outperforms full fine-tuning and LoRA on various tasks.

02

LoTA maintains performance after training on multiple tasks.

03

LoTA enables merging of models trained on dissimilar tasks.

Abstract

Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

1. The paper is well-written and the proposed method is clearly presented. 2. The proposed method is evaluated on a comprehensive set of settings, including sequential training and model merging which common PEFT papers do not usually evaluate. This adds another layer of strength and depth to the paper. 3. The proposed method shows empirically strong results over full fine-tuning and LoRA across all the aforementioned settings.

Weaknesses

1. My main concern is the technical novelty of the proposed method, particularly LoTA. [1] (and [2], the follow-up work by the same authors that scaled up to the latest LLMs) proposed a very similar method called "Lottery Ticket Sparse Fine-Tuning." By comparing Section 3.1 of [1] and Algorithm 1 of this paper, I find the only difference to be that [1] applies an L1 regularization for the sparse adaptation training part whereas this paper does not. To the best of my knowledge, the second algorit

Reviewer 02Rating 3Confidence 4

Strengths

The work leverages sparsity of model weights updates to provide a method for multi-task adaptation. It has similarities to post-hoc sparsification but allows to train these sparse weight deltas after mask calibration which differentiates it from other methods. Even if it may not have high practical utility in the current form due to high computational requirements compared to PEFT methods, the idea is interesting and may lead to significant improvements in multi-task adaptation approaches.

Weaknesses

The idea is generally good, but there are a number of issues which prevent from recommending the acceptance of the paper, and that need to be resolved. - Regarding catastrophic forgetting with LoTTO, there's lack of comprehensive experiments in regard to different task combinations to prove the robustness of the method - it was limited to forgetting of the GSM8k (Table 4) and Instruction Tuning (Table 5 & 6). - The experiments for sequential training are limited to only two tasks/LoTAs and don

Reviewer 03Rating 3Confidence 4

Strengths

This paper introduces a new method called Lottery Ticket Adaptation (LoTA). LoTA works by identifying and optimizing sparse task vectors, which means it only tweaks a small, crucial part of the model. This selective tuning helps the model remember its core skills while learning new tasks, thus preventing the common problem of catastrophic forgetting. The authors thoroughly test their method on various challenging tasks, such as following instructions, reasoning, solving math problems, and summa

Weaknesses

See Questions Section

Reviewer 04Rating 6Confidence 4

Strengths

- The method LoTA is very simple but effective, successfully solving the problems of previous methods. - Empirical results, especially on Model merging and sequential training setting. - The paper is easier to follow.

Weaknesses

- The paper only uses magnitude-based pruning methods. - It will be more fair to compare FFT and LoTA if FFT have the same number of epochs as the sum number of LoTA in the first and third phase. - The additional phase may introduce additional computation cost (although it is not very much) - Compare to LoRA and other PEFT methods, LoTA requires more GPU memory requirement when training.

Code & Models

Repositories

kiddyboots216/lottery-ticket-adaptation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Digital Rights Management and Security