T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization
Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen Imani

TL;DR
T-SAR introduces a novel CPU-based framework that leverages in-register LUT generation and SIMD reorganization to enable scalable, energy-efficient ternary LLM inference on edge devices, overcoming memory bottlenecks.
Contribution
It is the first to repurpose SIMD register files for dynamic LUT generation, significantly improving inference speed and energy efficiency without extensive hardware changes.
Findings
Achieves 5.6-24.5x reduction in GEMM latency
Delivers 1.1-86.2x higher GEMV throughput
Improves energy efficiency by up to 4.9x over NVIDIA Jetson AGX Orin
Abstract
Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
