Mutual Information Preserving Neural Network Pruning
Charles Westphal, Stephen Hailes, Mirco Musolesi

TL;DR
This paper introduces Mutual Information Preserving Pruning (MIPP), a novel activation-based neural network pruning method that maintains mutual information between layers, improving efficiency and re-trainability of pruned models both before and after training.
Contribution
MIPP is the first structured pruning technique that explicitly conserves mutual information between layer activations, applicable at different training stages, and proven to outperform existing methods.
Findings
MIPP outperforms state-of-the-art pruning methods.
MIPP preserves mutual information between layers.
Pruned models remain re-trainable.
Abstract
Pruning has emerged as the primary approach used to limit the resource requirements of large neural networks (NNs). Since the proposal of the lottery ticket hypothesis, researchers have focused either on pruning at initialization or after training. However, recent theoretical findings have shown that the sample efficiency of robust pruned models is proportional to the mutual information (MI) between the pruning masks and the model's training datasets, \textit{whether at initialization or after training}. In this paper, starting from these results, we introduce Mutual Information Preserving Pruning (MIPP), a structured activation-based pruning technique applicable before or after training. The core principle of MIPP is to select nodes in a way that conserves MI shared between the activations of adjacent layers, and consequently between the data and masks. Approaching the pruning problem…
Peer Reviews
Decision·Submitted to ICLR 2026
- The method is principled and grounded in a solid conceptual framework (information theory). - Whilst mutual information is a nice concept it is not a computationally convenient one. The authors have nevertheless found a way to approximate this computation and make their pruning algorithm scale (at least seeminly, see question below). - From the empirical evaluation it is clear that the proposed method works well, at least on the kind of task and architecture being considered. - The proposed m
Some technical aspects should be tightened/cleaned up a bit. Specifically: - 172: I would expect a Hadamard product $\odot$ here. Also, only a very particular kind of pruning is considered here: the mask is not an $m\times n$ matrix but an $n\times 1$ vector, which is equivalent to having an $m\times n$ mask where entire columns are set to zero, a very special kind of $m\times n$ mask. - 207: what exactly do you mean by “by gradient ascent”? - 207: what is $\mathcal{F}$? Is this related to the p
The overall paper is sensibly motivated in term on mutual information maintenance, though there is some question as to why inter-layer MI matters when the only info we care about is the info at the final layer. The paper proposes a layerwise pruning method that may differ from previous work.
Writing The paper is fairly well written, but would really benefit from: (a) concise obvious statement of the problem (this is fairly good but could be tighter) - really early on say what you want to achieve in non-ambiguous terms; pruning can mean many things an unstructured pruning is broken, so make sure you get it early you are doing structured pruning. Define the task precisely. (b) a precise statement of the insight. Again, it is good. You get across the insight, in that you say quite a
- Provides theoretical justification for the design of the method. - The method is explained well. - A large selection of datasets is explored. - Benchmarked against a collection of pruning methods, namely: IMP (PaT), SOSP-H (PaT), OTO (PaT), IterGraSP (PaI), IterSnip (PaI), ProsPR (PaI), and SynFlow (PaI), resulting in 3 for PaT and 4 for PaI.
Although the method allows for pruning at higher ratios while maintaining performance, compared to the methods explored, it suffers from a high discovery cost. In Figure 2a. Is it the train or test accuracy being presented? Additionally, it states that increasing the mutual information results in a performance increase; however, between an $I(\mathcal{D}; \mathcal{M})$ of 10 and 15, there is a performance drop for the conv model, with no explanation given. Figure 2c does not state the `differ
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsPruning
