End-to-End Transformer Acceleration Through Processing-in-Memory Architectures
Xiaoxuan Yang, Peilin Chen, Tergel Molom-Ochir, and Yiran Chen

TL;DR
This paper proposes processing-in-memory architectures to accelerate Transformer models by reducing data movement, managing memory growth, and lowering computational complexity, resulting in improved energy efficiency and latency.
Contribution
It introduces novel processing-in-memory techniques for Transformers, restructuring attention and feed-forward operations, and optimizing memory and complexity management.
Findings
Significant energy efficiency improvements over GPUs and accelerators
Reduced latency in Transformer inference tasks
Effective management of key-value cache growth
Abstract
Transformers have become central to natural language processing and large language models, but their deployment at scale faces three major challenges. First, the attention mechanism requires massive matrix multiplications and frequent movement of intermediate results between memory and compute units, leading to high latency and energy costs. Second, in long-context inference, the key-value cache (KV cache) can grow unpredictably and even surpass the model's weight size, creating severe memory and bandwidth bottlenecks. Third, the quadratic complexity of attention with respect to sequence length amplifies both data movement and compute overhead, making large-scale inference inefficient. To address these issues, this work introduces processing-in-memory solutions that restructure attention and feed-forward computation to minimize off-chip data transfers, dynamically compress and prune the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
