T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Hyunwoo Oh; KyungIn Nam; Rajat Bhattacharjya; Hanning Chen; Tamoghno Das; Sanggeon Yun; Suyeon Jang; Andrew Ding; Nikil Dutt; and Mohsen Imani

arXiv:2511.13676·cs.AR·November 18, 2025

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen Imani

PDF

Open Access

TL;DR

T-SAR introduces a novel CPU-based framework that leverages in-register LUT generation and SIMD reorganization to enable scalable, energy-efficient ternary LLM inference on edge devices, overcoming memory bottlenecks.

Contribution

It is the first to repurpose SIMD register files for dynamic LUT generation, significantly improving inference speed and energy efficiency without extensive hardware changes.

Findings

01

Achieves 5.6-24.5x reduction in GEMM latency

02

Delivers 1.1-86.2x higher GEMV throughput

03

Improves energy efficiency by up to 4.9x over NVIDIA Jetson AGX Orin

Abstract

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications