Mask in the Mirror: Implicit Sparsification
Tom Jacobs, Rebekka Burkholz

TL;DR
This paper provides a theoretical analysis of continuous sparsification in neural networks, revealing an implicit transition from L2 to L1 regularization, and introduces PILoT, a method that dynamically controls this bias to improve sparsification performance.
Contribution
It offers a theoretical explanation for implicit regularization dynamics and proposes PILoT, a novel sparsification method with dynamic regularization and improved results.
Findings
Implicit sparsification transitions from L2 to L1 regularization over time.
PILoT outperforms baseline methods in standard experiments.
Theoretical guarantees for convergence and optimality are established.
Abstract
Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit regularization that gradually transitions to an penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings…
Peer Reviews
Decision·ICLR 2025 Poster
The theoretical analysis of the reparametrization with time-varying weight decay is solid and novel. It is valuable to first develop a deep theoretical understanding, which then serves as a basis for algorithmic improvements.
**Major:** - **Inaccurate and unclear writing**: The presentation in the paper could benefit from a substantial revision to improve clarity, particularly in the latter half of Section 1 and Sections 2, 3, and 5. Additionally, some inaccuracies need to be addressed. For example, the abstract states that "A key factor in their (continuous sparsification) success is the implicit L1 regularization induced by jointly learning both mask and weight variables." However, the discussion from Lines 54-65 s
* The theoretical result seems sound though I didn't check the proofs. * The Bregman potential offers insight on $L_1$ regularization effect. * The gradient flow also offers insights on why spred outperforms Lasso from the perspective of convergence rate. * I appreciate that the authors also develop an algorithm inspired by the theory.
The writing is poor. Some notations are never explained, e.g., $m^2$ (though I can see that's the square of $L_2$ norm, but it is not the standard notation). The experiments are also a bit hard to read. Please see questions.
- Studying the sparsification of neural networks is an interesting research problem. - The proposed method that applies the idea of implicit/explicit regularization to pruning seems to be new. - The performance of the proposed method shown in experiments suggests that there is some improvement over the previous works.
- Many places are not very clear to me as a reader. - See questions section below.
Videos
Taxonomy
TopicsLinguistics and Discourse Analysis
MethodsLinear Regression
