TL;DR
This paper presents Condensa, a programmable system that automates neural network compression strategy design using Bayesian optimization, significantly reducing memory and runtime without manual trial-and-error.
Contribution
Introduces Condensa, a Python-based programmable framework that automatically infers optimal compression strategies for DNNs using Bayesian optimization.
Findings
Achieved 188x memory footprint reduction
Realized 2.59x hardware runtime throughput improvement
Operates effectively with at most ten samples per search
Abstract
Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
