Subspace Node Pruning

Joshua Offergeld; Marcel van Gerven; Nasir Ahmad

arXiv:2405.17506·cs.LG·October 3, 2025

Subspace Node Pruning

Joshua Offergeld, Marcel van Gerven, Nasir Ahmad

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a novel orthogonal subspace projection method for node pruning in neural networks, significantly reducing inference costs while maintaining performance, applicable to both CNNs and large language models.

Contribution

The work proposes a new orthogonal subspace approach for node pruning that optimally reduces redundancy and automatically determines pruning ratios, outperforming existing methods.

Findings

01

Achieves up to 24x lower computational cost.

02

Matches or exceeds state-of-the-art pruning results.

03

Effective on both CNNs and large language models.

Abstract

Improving the efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We furthermore show that the order in which units are orthogonalized can be optimized to maximally rank units by their redundancy. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 5

Strengths

The approach appears well-motivated. Global calibration of the pruning order of nodes across all layers is an attractive property. Experiments demonstrate competitive performance compared to alternative pruning approaches on VGG networks (VGG-11, VGG-16, VGG-19), and ResNet.

Weaknesses

The standard design of VGG networks may make them artificially good candidates for pruning. Specifically, the first fully-connected (FC) layer of VGG learns weights that take a 7x7x512 feature tensor to a 4096-dimensional vector; this involves 7*7*512*4096 = 102.76 million parameters just for that single layer. This is an incredibly inefficient design as an entire VGG-19 network has only 144 million parameters total. Modern CNNs do more gradual reduction of spatial size (e.g., 8x8 to 4x4 to 2

Reviewer 02Rating 5Confidence 4

Strengths

1. The GS-based orthogonalization for node pruning is neat and the importance score-based pre-ordering makes sense. 2. Experiments show the benefit of the proposed pre-ordering strategy. 3. The results are better than the compared method albeit marginally, with and without retraining.

Weaknesses

1. The improvements are marginal and not tested exhaustively on the latest architectures or other relevant pruning strategies. 2. Related to the above, low-rank pruning must be a baseline, given the similarity to the proposed method. 3. The novelty is limited unless there is a significant improvement over during SVD and low-rank-based pruning. Also, it is not clear which part of the method provides the most benefit (orthogonalization or the importance score).

Reviewer 03Rating 3Confidence 4

Strengths

The paper presents their approach in a clear fashion and layout the theory behind their approach nicely. The motivation and why this works is also clearly shown. The experiments show an improvement, and the overall story is well tied together.

Weaknesses

There's no proof of generalization to general networks or different architectures like the transformers, it is also not shown the sensitivity of this approach to language vs vision. Overall the paper seemed rushed and not well structured. Also a side note on formatting and section distribution: This seems very different from general papers submitted, and while it doesn't take away from the content, the lack of formal formatting and some inter-spread typos highlight that this was rushed.

Reviewer 04Rating 3Confidence 4

Strengths

S1. The problem of structured pruning, with and without pruning is hard and important. S2. The empirical results on the selected baselines seem encouraging.

Weaknesses

W1. [Writing] The paper is poorly written, and often imprecise. It is difficult to comprehend critical sections. Citing a few instances now 1. The writing flow needs to improved in section 2. - What is meant by least possible impact on dynamics [line 141]? - Why is it important for the subspace projections to be orthogonal? - Line 159. "..we wish for the final dot product between each pair of vectors.." Which vectors (row vectors or column vectors of X? Isn't the orthogonalisation depende

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAntenna Design and Optimization

MethodsVGG-16 · Convolution · Dropout · Dense Connections · Softmax · Max Pooling · Pruning