QStore: Quantization-Aware Compressed Model Storage
Raunak Shah, Zhaoheng Li, Yongjoo Park

TL;DR
QStore is a lossless compression format that efficiently stores multi-precision models by saving only residual information, significantly reducing storage costs and maintaining fast load times for both low and high precision models.
Contribution
QStore introduces a unified, lossless compression method that stores low-precision models and residuals to reconstruct high-precision models, saving storage without sacrificing speed.
Findings
Reduces storage footprint by up to 2.2x (45%)
Enables up to 1.7x faster model saving
Enables up to 1.8x faster model loading
Abstract
Modern applications commonly leverage large, multi-modal foundation models. These applications often feature complex workflows that demand the storage and usage of similar models in multiple precisions. A straightforward approach is to maintain a separate file for each model precision (e.g., INT8, BF16), which is indeed the approach taken by many model providers such as HuggingFace and Ollama. However, this approach incurs excessive storage costs since a higher precision model (e.g., BF16) is a strict superset of a lower precision model (e.g., INT8) in terms of information. Unfortunately, simply maintaining only the higher-precision model and requiring every user to dynamically convert the model precision is not desirable because every user of lower precision models must pay the cost for model download and precision conversion. In this paper, we present QStore, a unified, lossless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Database Systems and Queries
