GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters

Anand Choudhary; Yasser Sula{\i}man; Lukas Mauch; Ghouthi Boukli Hacene; Fabien Cardinaux; Antoine Bosselut

arXiv:2510.19778·cs.LG·October 23, 2025

GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters

Anand Choudhary, Yasser Sula{\i}man, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut

PDF

Open Access 3 Reviews

TL;DR

GaLLoP is a novel sparse fine-tuning method for large language models that selects parameters based on gradient magnitude and pre-trained importance, improving task performance and stability.

Contribution

Introduces GaLLoP, a gradient-based sparse fine-tuning technique that prioritizes task-relevant parameters with minimal disruption to pre-trained knowledge.

Findings

01

GaLLoP outperforms or matches existing fine-tuning methods on LLaMA3 and Gemma models.

02

GaLLoP reduces catastrophic forgetting and memorization of task data.

03

GaLLoP demonstrates stable performance across different random seeds.

Abstract

Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper is clearly written.

Weaknesses

(A) comparisons: Mitigating catastrophic forgetting is a rich field and this paper is missing comparisons against several state of the art methods, for example: (1) https://arxiv.org/abs/1612.00796 (classic method, works very well on modern LLMs as well) (2) https://arxiv.org/abs/2407.20999 (another method that selects individual parameters - based on a different logic - to update, instead of updating all parameters) (3) https://icml.cc/virtual/2025/poster/46655 ( more recent paper and method, c

Reviewer 02Rating 2Confidence 4

Strengths

1. Simple and straightforward design : The proposed scoring criterion is conceptually clear and easy to implement without architectural modification or additional modules. 2. Parameter-efficient fine-tuning : Only a small fraction of parameters (around <2%) are updated, yielding memory-efficient adaptation with no extra trainable layers. 3. Empirical stability : The method shows zero collapse or forgetting ratios across multiple seeds and densities, indicating robustness to optimization noise.

Weaknesses

1. Lack of theoretical justification : The claim that gradient magnitude directly represents parameter sensitivity is overstated. Without a connection to curvature-based measures such as the Fisher Information Matrix or Hessian spectrum, it is unclear whether $\|\|g\|\|$ accurately captures parameter importance. 2. Unverified assumption about parameter magnitude : The interpretation that low-magnitude parameters encode less critical or “more adjustable” knowledge is speculative and not theoreti

Reviewer 03Rating 6Confidence 3

Strengths

1) The paper proposes a simple, intuitive, but effective strategy for sparse fine tuning. 2) The strategy is well motivated by related experiments. 3) Exhaustive experimental results are presented to demonstrate the efficacy of the proposed technique. 4) An improvement in out-of-distribution tasks is shown.

Weaknesses

1) The idea is only experimentally demonstrated to work well. Theoretical justifications are missing. There are a number of view points and observations in literature about the choice of parameters while fine tuning. The role of magnitude of a parameter in determining its importance is not unanimously agreed upon. Similarly, gradient value while backpropagating on a small batch of data might be too noisy as a signal to score parameter importance. LoRA and related low-dimension projection techniq

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis