T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang

TL;DR
T-MAC introduces a lookup table-based method for efficient low-bit LLM inference on CPUs, significantly improving throughput and reducing energy consumption on edge devices by supporting mixed precision matrix multiplication without dequantization.
Contribution
It presents T-MAC, a novel LUT-based approach that enables direct mpGEMM support for low-bit LLMs on CPUs, eliminating the need for dequantization and enhancing efficiency.
Findings
Up to 4x throughput increase compared to llama.cpp
70% reduction in energy consumption
Achieves high token generation speed on various devices
Abstract
The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Advancements in Semiconductor Devices and Circuit Design · Parallel Computing and Optimization Techniques
MethodsLLaMA
