BitStack: Any-Size Compression of Large Language Models in Variable   Memory Environments

Xinghao Wang; Pengyu Wang; Bo Wang; Dong Zhang; Yunhua Zhou; Xipeng; Qiu

arXiv:2410.23918·cs.CL·February 18, 2025

BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments

Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng, Qiu

PDF

Open Access 1 Repo 7 Models

TL;DR

BitStack is a training-free, decomposition-based weight compression method for large language models that allows dynamic adjustment of model size with minimal performance loss, suitable for variable memory environments.

Contribution

Introduces BitStack, a novel, training-free weight decomposition approach enabling flexible, fine-grained model size control for LLMs in variable memory settings.

Findings

01

Matches or surpasses quantization baselines at extreme compression ratios

02

Provides approximately 1-bit per parameter residual blocks in each iteration

03

Enables dynamic memory-performance trade-offs with minimal transmission overhead

Abstract

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinghaow99/bitstack
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques

MethodsBatch Normalization · Convolution · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Block