Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
Elias Frantar, Sidak Pal Singh, Dan Alistarh

TL;DR
This paper introduces a unified, efficient framework for post-training compression of deep neural networks that combines pruning and quantization, significantly improving accuracy and compression trade-offs without retraining.
Contribution
A novel compression framework extending the Optimal Brain Surgeon method to include quantization, enabling effective one-shot model compression for DNNs.
Findings
Significantly better compression-accuracy trade-offs than existing methods
Enables accurate combined pruning and quantization post-training
Efficient in time and space for practical deployment
Abstract
We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMedical Imaging Techniques and Applications · Advanced MRI Techniques and Applications · Medical Image Segmentation Techniques
MethodsPruning
