An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks
Mohsen Dehghankar, Mahdi Erfanian, Abolfazl Asudeh

TL;DR
This paper introduces a novel matrix multiplication algorithm tailored for binary and ternary neural networks, significantly enhancing inference speed and memory efficiency, thereby making large models more accessible and cost-effective.
Contribution
The paper presents a new algorithm that preprocesses weight matrices for faster inference, achieving a logarithmic time complexity improvement and substantial practical speed and memory gains.
Findings
Up to 29x reduction in multiplication time
Up to 6x reduction in memory usage
Up to 5.24x inference speedup in LLMs
Abstract
Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsQuantum-Dot Cellular Automata · Low-power high-performance VLSI design · Cellular Automata and Applications
