DECA: A Near-Core LLM Decompression Accelerator Grounded on a 3D Roofline Model

Gerasimos Gerogiannis (Intel Corporation; University of Illinois at Urbana-Champaign); Stijn Eyerman (Intel Corporation); Evangelos Georganas (Intel Labs); Wim Heirman (Intel Corporation); Josep Torrellas (University of Illinois at Urbana-Champaign)

arXiv:2505.19349·cs.AR·August 11, 2025

DECA: A Near-Core LLM Decompression Accelerator Grounded on a 3D Roofline Model

Gerasimos Gerogiannis (Intel Corporation, University of Illinois at Urbana-Champaign), Stijn Eyerman (Intel Corporation), Evangelos Georganas (Intel Labs), Wim Heirman (Intel Corporation), Josep Torrellas (University of Illinois at Urbana-Champaign)

PDF

Open Access

TL;DR

This paper introduces DECA, a near-core decompression accelerator for large language models that improves inference performance by offloading decompression tasks, supported by a 3D Roofline performance model.

Contribution

It presents a novel analytical 3D Roofline model for understanding GeMM performance and a dedicated hardware accelerator with ISA extensions for efficient decompression.

Findings

01

DECA accelerates compressed GeMMs by up to 4x.

02

Reduces next-token generation time for Llama2-70B and OPT-66B by 1.6x-2.6x.

03

Provides insights into memory, vector, and hardware interactions for LLM inference.

Abstract

To alleviate the memory bandwidth bottleneck in Large Language Model (LLM) inference workloads, weight matrices are stored in memory in quantized and sparsified formats. Hence, before tiles of these matrices can be processed by in-core generalized matrix multiplication (GeMM) hardware engines, they need to be dequantized and de-sparsified. This is currently performed in software with vector operations. Unfortunately, this approach delivers only modest performance. Moreover, it is hard to understand how to improve the system, as the overall GeMM performance depends on the interaction between memory resources, vector units, and hardware matrix engines. To improve the performance of LLM inference in advanced platforms equipped with in-core GeMM engines and HBM, this paper makes three main contributions. First, it develops an analytical performance model with a 3D visual representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications · Radiation Detection and Scintillator Technologies · Parallel Computing and Optimization Techniques