Subspace Node Pruning
Joshua Offergeld, Marcel van Gerven, Nasir Ahmad

TL;DR
This paper introduces a novel orthogonal subspace projection method for node pruning in neural networks, significantly reducing inference costs while maintaining performance, applicable to both CNNs and large language models.
Contribution
The work proposes a new orthogonal subspace approach for node pruning that optimally reduces redundancy and automatically determines pruning ratios, outperforming existing methods.
Findings
Achieves up to 24x lower computational cost.
Matches or exceeds state-of-the-art pruning results.
Effective on both CNNs and large language models.
Abstract
Improving the efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We furthermore show that the order in which units are orthogonalized can be optimized to maximally rank units by their redundancy. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The approach appears well-motivated. Global calibration of the pruning order of nodes across all layers is an attractive property. Experiments demonstrate competitive performance compared to alternative pruning approaches on VGG networks (VGG-11, VGG-16, VGG-19), and ResNet.
The standard design of VGG networks may make them artificially good candidates for pruning. Specifically, the first fully-connected (FC) layer of VGG learns weights that take a 7x7x512 feature tensor to a 4096-dimensional vector; this involves 7*7*512*4096 = 102.76 million parameters just for that single layer. This is an incredibly inefficient design as an entire VGG-19 network has only 144 million parameters total. Modern CNNs do more gradual reduction of spatial size (e.g., 8x8 to 4x4 to 2
1. The GS-based orthogonalization for node pruning is neat and the importance score-based pre-ordering makes sense. 2. Experiments show the benefit of the proposed pre-ordering strategy. 3. The results are better than the compared method albeit marginally, with and without retraining.
1. The improvements are marginal and not tested exhaustively on the latest architectures or other relevant pruning strategies. 2. Related to the above, low-rank pruning must be a baseline, given the similarity to the proposed method. 3. The novelty is limited unless there is a significant improvement over during SVD and low-rank-based pruning. Also, it is not clear which part of the method provides the most benefit (orthogonalization or the importance score).
The paper presents their approach in a clear fashion and layout the theory behind their approach nicely. The motivation and why this works is also clearly shown. The experiments show an improvement, and the overall story is well tied together.
There's no proof of generalization to general networks or different architectures like the transformers, it is also not shown the sensitivity of this approach to language vs vision. Overall the paper seemed rushed and not well structured. Also a side note on formatting and section distribution: This seems very different from general papers submitted, and while it doesn't take away from the content, the lack of formal formatting and some inter-spread typos highlight that this was rushed.
S1. The problem of structured pruning, with and without pruning is hard and important. S2. The empirical results on the selected baselines seem encouraging.
W1. [Writing] The paper is poorly written, and often imprecise. It is difficult to comprehend critical sections. Citing a few instances now 1. The writing flow needs to improved in section 2. - What is meant by least possible impact on dynamics [line 141]? - Why is it important for the subspace projections to be orthogonal? - Line 159. "..we wish for the final dot product between each pair of vectors.." Which vectors (row vectors or column vectors of X? Isn't the orthogonalisation depende
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAntenna Design and Optimization
MethodsVGG-16 · Convolution · Dropout · Dense Connections · Softmax · Max Pooling · Pruning
