QKV Projections Require a Fraction of Their Memory

Malik Khalaf; Yara Shamshoum; Nitzan Hodos; Yuval Sieradzki; Assaf Schuster

arXiv:2506.02939·cs.LG·March 3, 2026

QKV Projections Require a Fraction of Their Memory

Malik Khalaf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PAMM, a tensor compression method that significantly reduces memory usage of Q, K, V projections in multi-head attention, enabling more memory-efficient training of large language models without sacrificing performance.

Contribution

The paper presents PAMM, a novel tensor compression technique that drastically reduces memory for Q, K, V projections, improving efficiency in LLM training.

Findings

01

PAMM compresses Q, K, V activations by up to 512x.

02

PAMM achieves similar or better perplexity with reduced memory.

03

PAMM is compatible with efficient attention methods like FlashAttention.

Abstract

The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$ , $K$ , and $V$ tensors from the input $x$ is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that compresses the activations of the $Q, K, V$ projections in attention layers by a factor of up to $\times 512$ , effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper presents a clear motivation and strong practical relevance — training large Transformers is often constrained by activation memory, and this work directly addresses a crucial scalability challenge. The proposed method introduces learnable projection mappings that preserve gradient reconstructability, an elegant and theoretically grounded idea that distinguishes itself from prior techniques such as activation checkpointing or reversible layers. The empirical results are convincing, demo

Weaknesses

While the proposed PAMM framework is conceptually sound and practically motivated, the experimental evaluation is somewhat limited. The current experiments are mainly conducted on mid-scale Transformer models, and it remains unclear how the method scales to very large architectures (e.g., >1B parameters) or longer sequence settings. Moreover, the ablation analysis on the projection dimension ratio and per-layer compression strategies is rather sparse—more systematic studies could strengthen the

Reviewer 02Rating 4Confidence 5

Strengths

* The paper is well written, clearly structured, and easy to follow. * The proposed method is supported by solid theoretical analysis. * The approach effectively reduces training-time memory consumption while maintaining, or even slightly improving, model accuracy with minimal degradation.

Weaknesses

* **Limited profiling on sequence redundancy:** The paper offers initial empirical evidence of sequence‑axis redundancy (Appendix F: PCA clustering; relative error and coverage), but the analysis is confined to a narrow slice (one layer/model/step). A broader study would better ground the motivation. * **Missing complexity/scaling analysis:** While the runtime breakdown is helpful (Table 2), the paper lacks an explicit complexity analysis of the compression and approximate multiply and a scalin

Reviewer 03Rating 6Confidence 3

Strengths

- The proposed method identifies the redundancy in the sequence dimension, which is typically enormous in modern LLM training. - Outperforms other methods like CompAct and Uniform-CRS in memory-performance tradeoffs.

Weaknesses

- the experiments are limited to small to medium size LLMs. Scaling up models can introduce performance degradation of the approximation method. - Activations are dynamic during training. Random sampling is slow, particularly on GPUs. Note that random sampling requires a pseudo-random number generator, such as an LFSR or others. Most algorithms in this family are sequential and model generation from an irreducible in a Galois field. The authors should check the intrinsic details of cudaranddx.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced MEMS and NEMS Technologies