Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
Zehao Fan, Garrett Gagnon, Zhenyu Liu, Liu Liu

TL;DR
This paper introduces STARC, a clustering-based data mapping scheme that enhances the efficiency of large language model decoding on processing-in-memory architectures by reducing latency and energy consumption.
Contribution
STARC is a novel sparsity-aware data mapping approach that clusters KV pairs for PIM, improving workload balance and decoding efficiency in LLMs.
Findings
STARC reduces attention-layer latency by up to 31%.
Energy consumption during decoding is decreased by up to 27%.
Achieves 54%-74% latency reduction with comparable accuracy.
Abstract
Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
