SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker,, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman, Naderiparizi

TL;DR
SeedLM introduces a data-free, seed-based compression technique for LLMs that reduces memory access and speeds up inference, maintaining accuracy while significantly decreasing model size.
Contribution
The paper presents a novel post-training, seed-based compression method for LLMs that is data-free and generalizes across tasks, outperforming existing techniques in accuracy retention and inference speed.
Findings
SeedLM achieves better zero-shot accuracy at 4- and 3-bit quantization.
It maintains performance comparable to FP16 baselines.
FPGA tests show near 4x speed-up for 70B models.
Abstract
Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art compression methods that rely on calibration data, our approach is data-free and generalizes well across…
Peer Reviews
Decision·ICLR 2025 Poster
Novel method to store and retrieve codes with pseudo-random number generator. High quality presentation with necessary formulas and diagrams Attention to implementation details in Performance Analysis section.
1) Comparison with other quantization methods is incomplete. Most striking shortcoming is lack of comparison with finetuned model which is what most current SOTA models use. 2) The paper dismisses comparison with strong methods like AQ, SPQR in desire to "avoid costly training". Yet these are quite good benchmarks to compare with, theu=y have reported figures, and to larg share of practitioners the extra training time (hours actually) could be acceptable. 3) some of the results in table 2 are
1- **Innovative Use of Arbitrary Data Formats**: The adoption of arbitrary data formats with shared exponents is a commendable design choice, enhancing the flexibility of SeedLM's quantization approach. 2- **Efficient FPGA Implementation**: Proposing an efficient FPGA implementation demonstrates the hardware viability of SeedLM and highlights potential real-world deployment in resource-constrained environments. 3- **Data-Free Compression**: SeedLM operates without calibration data, which diffe
1- **Absence of GPTQ Comparison**: The paper does not provide a comparison with GPTQ, a commonly used quantization baseline, which is a notable omission given GPTQ's relevance to LLM compression. 2- **Inference Efficiency Assumptions**: While the paper mentions using the latest repositories, many of these codebases likely store compressed weights in full precision during inference, leading to potential memory inefficiencies. 3- **GPU Implementation Challenges**: Although FPGA implementation is
1. Weight compression using pseudo-random generator seeds is a novel-sounding technique. It enables significant compression while maintaining high accuracy. 2. Unlike many state-of-the-art compression methods, the proposed method does not require calibration data, reducing the need for correction data acquisition and potentially further reducing the quantization offset problem caused by the calibration data distribution. 3. The authors validated the computational characteristics and efficiency o
1. The author compared the AWQ, Omniquant, and QuIP# methods. However, Omniquant and QuIP# were primarily designed for ultra-low bit-width quantization compression, such as 2-bit, but the author only compared the performance of 3/4-bit and did not show the quantization results of 2-bit. In the field of LLM quantization, SOTA methods specifically designed for 4/3-bit, such as GPTQ[1], were not included in the comparison. This makes the results unconvincing. 2. The author mentions in Section 4.1,
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsLLaMA
