BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng, Qiu

TL;DR
BitStack is a training-free, decomposition-based weight compression method for large language models that allows dynamic adjustment of model size with minimal performance loss, suitable for variable memory environments.
Contribution
Introduces BitStack, a novel, training-free weight decomposition approach enabling flexible, fine-grained model size control for LLMs in variable memory settings.
Findings
Matches or surpasses quantization baselines at extreme compression ratios
Provides approximately 1-bit per parameter residual blocks in each iteration
Enables dynamic memory-performance trade-offs with minimal transmission overhead
Abstract
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BitStack/BitStack-Llama-3.1-8Bmodel· 5 dl5 dl
- 🤗BitStack/BitStack-Llama-2-7Bmodel· 2 dl2 dl
- 🤗BitStack/BitStack-Llama-2-13Bmodel· 12 dl12 dl
- 🤗BitStack/BitStack-Llama-3-8Bmodel· 4 dl4 dl
- 🤗BitStack/BitStack-Llama-3.1-70Bmodel· 3 dl3 dl
- 🤗BitStack/BitStack-Llama-3-70Bmodel· 2 dl2 dl
- 🤗BitStack/BitStack-Llama-2-70Bmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques
MethodsBatch Normalization · Convolution · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Block
