Study of Training Dynamics for Memory-Constrained Fine-Tuning
A\"el Qu\'elennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione

TL;DR
This paper introduces TraDy, a memory-efficient transfer learning method that dynamically selects channels during training, enabling large neural networks to be fine-tuned under strict memory constraints with minimal performance loss.
Contribution
TraDy is a novel dynamic channel selection scheme that improves gradient approximation and reduces memory usage in fine-tuning large models, outperforming static methods.
Findings
Achieves up to 99% activation sparsity and 95% weight derivative sparsity.
Reduces FLOPs for weight derivative computation by 97%.
Demonstrates state-of-the-art performance across various tasks and architectures.
Abstract
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
Peer Reviews
Decision·ICLR 2026 Poster
1. **Strong Theoretical and Experimental Grounding:** The paper's primary strength lies in its thorough justification. The authors do not just propose a method but provide a robust analysis of *why* it should work. The experimental validation of the three key insights (heavy-tailed gradients, task-invariant layer ranks, task-dependent channel ranks) is convincing and provides a solid foundation for the proposed hybrid design. 2. **High Efficiency and Strong Performance:** The method achieves
1. **Evaluation on Simple Tasks:** The empirical evaluation, while broad in terms of datasets (CIFAR-10/100, CUB, Flowers, etc.), is primarily limited to relatively simple, small-scale classification tasks. To truly validate the robustness and scalability of TraDy, an evaluation on more complex, large-scale benchmarks (e.g., ImageNet-1K) is necessary. 2. **Missing Comparison to Key PEFT Methods:** The paper's related work and experimental comparisons focus almost exclusively on *sparse update
- The analysis of the stochastic gradients' heavy-tailed behaviour during fine-tuning of a pre-trained network, relative importance of the network layers, consistency across downstream tasks and channel importance distribution is rigorously presented and well explained.
- The paper is currently lacking comparison with the state-of-the-art fine-tuning reported in section 2 (i.e., Lin et. al (2022), Kwon et al. (2024) and Quèllenec et al. (2024)). - TraDy performances are reported in the main paper only for CNN models. - Unfortunately, the plots reported in Fig. 3, Fig.4 and Fig. 6 are not easy to read and to position within the main contributions of TraDy.
1. This paper is grounded in a solid theoretical foundation, featuring detailed derivations and rigorous argumentation. 2. This article is well-written and well-organized.
1.Fig. 2 is positioned too far from the corresponding text section. It is recommended to optimize the image layout. 2.The baseline for existing studies compared in this paper is limited, with only SU available. 3.It is recommended to highlight the best-performing results in Tables 1-5.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
