An Efficient Matrix Multiplication Algorithm for Accelerating Inference   in Binary and Ternary Neural Networks

Mohsen Dehghankar; Mahdi Erfanian; Abolfazl Asudeh

arXiv:2411.06360·cs.LG·May 5, 2025

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Mohsen Dehghankar, Mahdi Erfanian, Abolfazl Asudeh

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel matrix multiplication algorithm tailored for binary and ternary neural networks, significantly enhancing inference speed and memory efficiency, thereby making large models more accessible and cost-effective.

Contribution

The paper presents a new algorithm that preprocesses weight matrices for faster inference, achieving a logarithmic time complexity improvement and substantial practical speed and memory gains.

Findings

01

Up to 29x reduction in multiplication time

02

Up to 6x reduction in memory usage

03

Up to 5.24x inference speedup in LLMs

Abstract

Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a $n \times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uic-indexlab/rsr
pytorchOfficial

Videos

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks· slideslive

Taxonomy

TopicsQuantum-Dot Cellular Automata · Low-power high-performance VLSI design · Cellular Automata and Applications