AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki

TL;DR
AQPIM introduces a PIM-aware activation quantization method using Product Quantization to reduce memory and computation bottlenecks in large language models, enabling efficient in-memory processing.
Contribution
This work presents AQPIM, a novel activation quantization framework tailored for PIM architectures, improving efficiency and accuracy for large language models.
Findings
AQPIM reduces GPU-CPU communication by up to 98.5%.
Achieves 3.4× speedup over state-of-the-art PIM methods.
Significantly cuts memory footprint and computational overhead.
Abstract
Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
