NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
Yongchang Hao, Yanshuai Cao, Lili Mou

TL;DR
NeuZip introduces a novel weight compression scheme that significantly reduces memory usage during training and inference of neural networks without performance loss, enabling more efficient deployment on memory-constrained devices.
Contribution
NeuZip presents a new entropy-based weight compression method that maintains model performance while drastically reducing memory requirements during training and inference.
Findings
Reduced training memory for Llama-3 8B from 31GB to under 16GB
Halved memory usage during inference with near-lossless performance
Maintained training dynamics despite compression
Abstract
The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
