TL;DR
This paper introduces BSFA, a framework that accelerates neural network training by differentially scaling update components in the top eigendirections and orthogonal subspaces, improving speed and stability.
Contribution
The paper proposes BSFA, a novel plug-and-play method that uses PCA-based subspace estimation and block-wise strategies to enhance training efficiency of large models.
Findings
Achieves approximately 2× speedup in pre-training LLaMA models.
Effectively balances stability and convergence speed through subspace scaling.
Demonstrates broad applicability across different tasks and models.
Abstract
Recent studies \citep{gur2018gradient,song2024does, wen2024understanding} highlight a fundamental dichotomy in deep learning optimization: Although parameter updates along the top eigendirections of the loss Hessian (Dom-space) capture most of the update magnitude, they often contribute minimally to loss reduction. In contrast, updates in the orthogonal component (Bulk-space) have smaller magnitudes but drive most learning progress. In this work, we further advance the understanding of this phenomenon and introduce the \textbf{Bulk-Space-Filtration-Accelerator (BSFA)}, a novel plug-and-play framework. BSFA accelerates training by differentially scaling update components projected onto these distinct subspaces, simultaneously enhancing stability by moderating updates in the dominant subspace and boosting convergence speed by amplifying those in the bulk-space. To ensure BSFA is both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
