VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
Zi-Wei Lin, and Tian-Sheuan Chang

TL;DR
VitaLLM is a compact, efficient accelerator designed for running large language models with mixed precision on edge devices, combining innovative cores and sparse attention mechanisms.
Contribution
The paper introduces VitaLLM, a novel mixed-precision accelerator with unique cores and sparse attention, enabling practical LLM inference on edge hardware.
Findings
Achieves 72.46 tokens/sec in decoding at 1 GHz/0.8 V.
Reduces key/value traffic and improves utilization through system-level design.
Demonstrates practical 3B parameter LLM inference on edge-class platforms.
Abstract
We present VitaLLM, a mixed precision accelerator that enables ternary weight large language models to run efficiently on edge devices. The design combines two compute cores, a multiplier free TINT core for ternary-INT projections and a BoothFlex core that reuses a radix-4 Booth datapath for both INT8INT8 attention and ternary-INT-sustaining utilization without duplicating arrays. A predictive sparse attention mechanism employs a leading-one (LO) surrogate with a comparison-free top- selector to prune key/value (KV) fetches by roughly for cached tokens, confining exact attention to candidates. System-level integration uses head-level pipelining and an absmax-based quantization barrier to standardize cross-core interfaces and overlap nonlinear reductions with linear tiles. A 16 nm silicon prototype at 1 GHz/0.8 V achieves 72.46 tokens/s in decode and 0.88 s…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
