TL;DR
This paper introduces quantization-aware pruning, a technique combining pruning and quantization during training to create neural networks optimized for ultra low latency inference with improved efficiency.
Contribution
The study systematically explores the interplay of pruning and quantization-aware training, demonstrating its advantages over individual techniques and other neural architecture search methods.
Findings
Quantization-aware pruning improves computational efficiency over pruning or quantization alone.
It performs comparably or better than Bayesian optimization in efficiency.
Network information content varies with training configurations, impacting generalizability.
Abstract
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
