TL;DR
RSR-core is a high-performance engine that accelerates low-bit matrix-vector multiplication for neural networks, enabling faster inference on CPU and GPU with significant speedups.
Contribution
It introduces RSR-core, an optimized low-level implementation of the RSR algorithm for efficient low-bit matrix-vector multiplication in inference pipelines.
Findings
Up to 62x speedup on CPU over baseline PyTorch multiplication.
Up to 1.9x speedup for token generation on CUDA.
Supports binary and ternary weight matrices with practical deployment.
Abstract
Matrix-vector multiplication is a fundamental building block in neural networks, vector databases, and large language models, particularly during inference. As a result, efficient matrix-vector multiplication engines directly translate into more efficient inference. Recent work has explored low-bit quantization of model weights, where matrices are represented using binary (1-bit) or ternary (1.58-bit) values while activation is kept in higher precision. These representations enable efficient hardware-level computation. In parallel, algorithms such as Redundant Segment Reduction (RSR) provide theoretical guarantees for accelerating low-bit matrix-vector multiplication. However, existing implementations operate at the application level and cannot be efficiently integrated into hardware kernels, limiting practical performance. To bridge this gap, we present RSR-core, a high-performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
