TOM: A Ternary Read-only Memory Accelerator for LLM-powered Edge Intelligence
Hongyi Guan, Yijia Zhang, Wenqiang Wang, Yizhao Gao, Shijie Cao, Chen Zhang, Ningyi Xu

TL;DR
TOM is a novel hybrid ROM-SRAM accelerator leveraging ternary quantization to enable high-density, energy-efficient, real-time LLM inference on edge devices, overcoming memory and bandwidth limitations.
Contribution
It introduces a ternary quantization-based hybrid ROM-SRAM architecture with a sparsity-aware ROM design and workload-aware power management for edge LLM deployment.
Findings
Achieves 3,306 TPS inference throughput with BitNet-2B model.
Demonstrates significant energy efficiency improvements.
Balances model density and tunability for edge AI applications.
Abstract
The deployment of Large Language Models (LLMs) for real-time intelligence on edge devices is rapidly growing. However, conventional hardware architectures face a fundamental memory wall challenge, where limited on-device memory capacity and bandwidth severely constrain the size of deployable models and their inference speed, while also limiting on-device adaptation. To address this challenge, we propose TOM, a hybrid ROM-SRAM accelerator co-designed with ternary quantization, which balances extreme density with on-device tunability. TOM exploits the synergy between ternary quantization and ROM to achieve extreme memory density and bandwidth, while preserving flexibility through a hybrid ROM-SRAM architecture designed for QLoRA-based tunability. Specifically, we introduce: (1) a sparsity-aware ROM architecture that synthesizes ternary weights as standard-cell logic, eliminating area…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques
