Partition Pruning: Parallelization-Aware Pruning for Deep Neural Networks
Sina Shahhosseini, Ahmad Albaqsami, Masoomeh Jasemi, Nader Bagherzadeh

TL;DR
Partition Pruning is a novel method that reduces neural network parameters considering parallelization, significantly speeding up inference and decreasing energy consumption with minimal accuracy loss.
Contribution
It introduces a new partition pruning scheme that optimizes neural network pruning for parallel inference, improving speed and energy efficiency.
Findings
7.72x speedup in inference performance
2.73x reduction in energy consumption
Limited accuracy reduction in pruned models
Abstract
Parameters of recent neural networks require a huge amount of memory. These parameters are used by neural networks to perform machine learning tasks when processing inputs. To speed up inference, we develop Partition Pruning, an innovative scheme to reduce the parameters used while taking into consideration parallelization. We evaluated the performance and energy consumption of parallel inference of partitioned models, which showed a 7.72x speed up of performance and a 2.73x reduction in the energy used for computing pruned layers of TinyVGG16 in comparison to running the unpruned model on a single accelerator. In addition, our method showed a limited reduction some numbers in accuracy while partitioning fully connected layers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
