End-to-End Transformer Acceleration Through Processing-in-Memory Architectures

Xiaoxuan Yang; Peilin Chen; Tergel Molom-Ochir; and Yiran Chen

arXiv:2601.14260·cs.AR·January 22, 2026

End-to-End Transformer Acceleration Through Processing-in-Memory Architectures

Xiaoxuan Yang, Peilin Chen, Tergel Molom-Ochir, and Yiran Chen

PDF

Open Access

TL;DR

This paper proposes processing-in-memory architectures to accelerate Transformer models by reducing data movement, managing memory growth, and lowering computational complexity, resulting in improved energy efficiency and latency.

Contribution

It introduces novel processing-in-memory techniques for Transformers, restructuring attention and feed-forward operations, and optimizing memory and complexity management.

Findings

01

Significant energy efficiency improvements over GPUs and accelerators

02

Reduced latency in Transformer inference tasks

03

Effective management of key-value cache growth

Abstract

Transformers have become central to natural language processing and large language models, but their deployment at scale faces three major challenges. First, the attention mechanism requires massive matrix multiplications and frequent movement of intermediate results between memory and compute units, leading to high latency and energy costs. Second, in long-context inference, the key-value cache (KV cache) can grow unpredictably and even surpass the model's weight size, creating severe memory and bandwidth bottlenecks. Third, the quadratic complexity of attention with respect to sequence length amplifies both data movement and compute overhead, making large-scale inference inefficient. To address these issues, this work introduces processing-in-memory solutions that restructure attention and feed-forward computation to minimize off-chip data transfers, dynamically compress and prune the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy