TOM: A Ternary Read-only Memory Accelerator for LLM-powered Edge Intelligence

Hongyi Guan; Yijia Zhang; Wenqiang Wang; Yizhao Gao; Shijie Cao; Chen Zhang; Ningyi Xu

arXiv:2602.20662·cs.AR·February 25, 2026

TOM: A Ternary Read-only Memory Accelerator for LLM-powered Edge Intelligence

Hongyi Guan, Yijia Zhang, Wenqiang Wang, Yizhao Gao, Shijie Cao, Chen Zhang, Ningyi Xu

PDF

Open Access

TL;DR

TOM is a novel hybrid ROM-SRAM accelerator leveraging ternary quantization to enable high-density, energy-efficient, real-time LLM inference on edge devices, overcoming memory and bandwidth limitations.

Contribution

It introduces a ternary quantization-based hybrid ROM-SRAM architecture with a sparsity-aware ROM design and workload-aware power management for edge LLM deployment.

Findings

01

Achieves 3,306 TPS inference throughput with BitNet-2B model.

02

Demonstrates significant energy efficiency improvements.

03

Balances model density and tunability for edge AI applications.

Abstract

The deployment of Large Language Models (LLMs) for real-time intelligence on edge devices is rapidly growing. However, conventional hardware architectures face a fundamental memory wall challenge, where limited on-device memory capacity and bandwidth severely constrain the size of deployable models and their inference speed, while also limiting on-device adaptation. To address this challenge, we propose TOM, a hybrid ROM-SRAM accelerator co-designed with ternary quantization, which balances extreme density with on-device tunability. TOM exploits the synergy between ternary quantization and ROM to achieve extreme memory density and bandwidth, while preserving flexibility through a hybrid ROM-SRAM architecture designed for QLoRA-based tunability. Specifically, we introduce: (1) a sparsity-aware ROM architecture that synthesizes ternary weights as standard-cell logic, eliminating area…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques