31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

Pingcheng Dong; Yonghao Tan; Xuejiao Liu; Peng Luo; Yu Liu; Di Pang; Songchen Ma; Xijie Huang; Shih-Yang Liu; Dong Zhang; Zhichao Lu; Luhong Liang; Chi-Ying Tsui; Fengbin Tu; Liang Zhao; Kwang-Ting Cheng

arXiv:2605.09375·cs.AR·May 12, 2026

31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu, Di Pang, Songchen Ma, Xijie Huang, Shih-Yang Liu, Dong Zhang, Zhichao Lu, Luhong Liang, Chi-Ying Tsui, Fengbin Tu, Liang Zhao, Kwang-Ting Cheng

PDF

TL;DR

This paper introduces a high-speed, energy-efficient LLM accelerator using ReRAM-on-logic stacking, innovative quantization, and adaptive speculative decoding to significantly improve processing speed and resource utilization.

Contribution

It presents a novel 55nm speculative decoding-based LLM accelerator with ReRAM stacking, outlier-free quantization, and adaptive parallel decoding for enhanced performance.

Findings

01

Achieves 14.08-to-135.69 tokens/sec processing speed.

02

Provides 4.46-to-7.17x speedup over vanilla speculative decoding.

03

Demonstrates high resource and bandwidth utilization.

Abstract

This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.