AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

Kosuke Matsushima; Yasuyuki Okoshi; Masato Motomura; Daichi Fujiki

arXiv:2604.18137·cs.AR·April 21, 2026

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki

PDF

TL;DR

AQPIM introduces a PIM-aware activation quantization method using Product Quantization to reduce memory and computation bottlenecks in large language models, enabling efficient in-memory processing.

Contribution

This work presents AQPIM, a novel activation quantization framework tailored for PIM architectures, improving efficiency and accuracy for large language models.

Findings

01

AQPIM reduces GPU-CPU communication by up to 98.5%.

02

Achieves 3.4× speedup over state-of-the-art PIM methods.

03

Significantly cuts memory footprint and computational overhead.

Abstract

Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.