SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning
Lei You, Hei Victor Cheng

TL;DR
SWAP introduces a robust network pruning method using Entropic Wasserstein regression that effectively mitigates noise in gradients, leading to improved accuracy especially in large or noisy scenarios.
Contribution
The paper proposes SWAP, a novel pruning technique based on Entropic Wasserstein regression, which enhances noise robustness and preserves covariance information during network pruning.
Findings
SWAP achieves comparable performance to state-of-the-art methods.
SWAP outperforms existing methods with large network sizes or high sparsity.
SWAP improves MobileNetV1 accuracy by 6% with significantly fewer parameters.
Abstract
This study addresses the challenge of inaccurate gradients in computing the empirical Fisher Information Matrix during neural network pruning. We introduce SWAP, a formulation of Entropic Wasserstein regression (EWR) for pruning, capitalizing on the geometric properties of the optimal transport problem. The ``swap'' of the commonly used linear regression with the EWR in optimization is analytically demonstrated to offer noise mitigation effects by incorporating neighborhood interpolation across data points with only marginal additional computational cost. The unique strength of SWAP is its intrinsic ability to balance noise reduction and covariance information preservation effectively. Extensive experiments performed on various networks and datasets show comparable performance of SWAP with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when…
Peer Reviews
Decision·ICLR 2024 poster
- The paper is well-written and the problem is well-motivated. - The proposed method has desirable properties and shows improved performance over previous methods, especially at larger sparsity.
- “The noise level σ is set to be the standard deviation of the original gradients”. Why is this the noise level for both gradients and data? I would like to see a more detailed explanation how how the noise is added to data and gradients. - Can the authors also provide an accuracy table for Table 2 and Table 3?
The following are the primary strengths of this paper: - The authors propose a straightforward (but novel) modification to the sparse LR framework for neural network pruning. The modification amounts to an additional regularization term grounded by an interpretation using the principles of optimal transport. The optimization problem remains efficiently solvable. - The authors motivate their method via an analysis of the robustness properties exhibited by solutions to their optimization probl
As a reviewer, I highlight that I am unfamiliar with the current state-of-the-art pruning techniques. I defer to other reviewers regarding the thoroughness of the comparative experiments.However, the method seems grounded. The structure of the manuscript is OK. The writing and clarity of this paper could be significantly improved. In particular, many phrases and statements are unclear, beginning with the abstract: _This study unveils a cutting-edge technique for neural network pruning that jud
1. The reformulation is novel as far as I know. The author successfully connect the reformulation to existing sparse regression set up, which makes a good story here. 2. The analysis on neighborhood control is also insightful.
1. The experiments are weak, without test on state-of-the-art architectures like transformers, or larger models like ResNet50, making it suspicious that the proposed approach does not work well on larger model sizes.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsLinear Regression · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Batch Normalization · Depthwise Convolution · 1x1 Convolution · Average Pooling · Pointwise Convolution · Depthwise Separable Convolution · Convolution
