Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing
Nived Rajaraman, Devvrit, Aryan Mokhtari, Kannan Ramchandran

TL;DR
This paper provides the first rigorous theoretical analysis explaining why greedy pruning combined with fine-tuning leads to smaller models that generalize well, focusing on overparameterized matrix sensing with group Lasso regularization.
Contribution
It introduces a provable framework showing how pruning and fine-tuning with regularization results in minimal, well-generalized models in matrix sensing.
Findings
Pruning below a certain norm threshold yields a minimal model close to ground truth.
Gradient descent from pruned models converges linearly to a good solution.
Regularization is crucial for effective greedy pruning and generalization.
Abstract
Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. In fact, several practical studies have shown that if a pruned model is fine-tuned with some gradient-based updates it generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the complexity of trained models, there is very little known about the theory behind this success. In this paper, we address this issue by investigating the pruning + fine-tuning framework on the overparameterized matrix sensing problem with the ground truth and the overparameterized model with . We study the approximate local minima of the mean square error, augmented with a smooth version of a group Lasso regularizer,…
Peer Reviews
Decision·NeurIPS 2023 poster
The paper is very well written and easy to follow. The problem considered (noisy overparametrized noisy matrix sensing) is relevant by itself and additionally can provide insights for more complicated learning models which can be of great interest to the community. The results ar Pruning as a technique to solve this specific problem is very well motivated based on both Theorem 1 and achievable statistical precision in the overparametrized setting (c.f. line 312).
In Theorem 3, the regularizer weight and the thresholding for pruning are both explicitly dependent on the target rank r. Thus, knowledge of r is required to a degree anyway. Consequently, the comparison made to the overparametrized setting from [23] does not seem entirely fair.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Geophysical and Geoelectrical Methods
MethodsPruning
