T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on   Edge

Jianyu Wei; Shijie Cao; Ting Cao; Lingxiao Ma; Lei Wang; Yanyong Zhang; and Mao Yang

arXiv:2407.00088·cs.DC·March 26, 2025

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang

PDF

Open Access 1 Repo

TL;DR

T-MAC introduces a lookup table-based method for efficient low-bit LLM inference on CPUs, significantly improving throughput and reducing energy consumption on edge devices by supporting mixed precision matrix multiplication without dequantization.

Contribution

It presents T-MAC, a novel LUT-based approach that enables direct mpGEMM support for low-bit LLMs on CPUs, eliminating the need for dequantization and enhancing efficiency.

Findings

01

Up to 4x throughput increase compared to llama.cpp

02

70% reduction in energy consumption

03

Achieves high token generation speed on various devices

Abstract

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/t-mac
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Advancements in Semiconductor Devices and Circuit Design · Parallel Computing and Optimization Techniques

MethodsLLaMA