VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

Zi-Wei Lin; and Tian-Sheuan Chang

arXiv:2605.00320·cs.AR·May 4, 2026

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

Zi-Wei Lin, and Tian-Sheuan Chang

PDF

TL;DR

VitaLLM is a compact, efficient accelerator designed for running large language models with mixed precision on edge devices, combining innovative cores and sparse attention mechanisms.

Contribution

The paper introduces VitaLLM, a novel mixed-precision accelerator with unique cores and sparse attention, enabling practical LLM inference on edge hardware.

Findings

01

Achieves 72.46 tokens/sec in decoding at 1 GHz/0.8 V.

02

Reduces key/value traffic and improves utilization through system-level design.

03

Demonstrates practical 3B parameter LLM inference on edge-class platforms.

Abstract

We present VitaLLM, a mixed precision accelerator that enables ternary weight large language models to run efficiently on edge devices. The design combines two compute cores, a multiplier free TINT core for ternary-INT projections and a BoothFlex core that reuses a radix-4 Booth datapath for both INT8 $\times$ INT8 attention and ternary-INT-sustaining utilization without duplicating arrays. A predictive sparse attention mechanism employs a leading-one (LO) surrogate with a comparison-free top- $K$ selector to prune key/value (KV) fetches by roughly $1 - K / M$ for $M$ cached tokens, confining exact attention to $K$ candidates. System-level integration uses head-level pipelining and an absmax-based quantization barrier to standardize cross-core interfaces and overlap nonlinear reductions with linear tiles. A 16 nm silicon prototype at 1 GHz/0.8 V achieves 72.46 tokens/s in decode and 0.88 s…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.