QKV Projections Require a Fraction of Their Memory
Malik Khalaf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

TL;DR
This paper introduces PAMM, a tensor compression method that significantly reduces memory usage of Q, K, V projections in multi-head attention, enabling more memory-efficient training of large language models without sacrificing performance.
Contribution
The paper presents PAMM, a novel tensor compression technique that drastically reduces memory for Q, K, V projections, improving efficiency in LLM training.
Findings
PAMM compresses Q, K, V activations by up to 512x.
PAMM achieves similar or better perplexity with reduced memory.
PAMM is compatible with efficient attention methods like FlashAttention.
Abstract
The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the , , and tensors from the input is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that compresses the activations of the projections in attention layers by a factor of up to , effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.
Peer Reviews
Decision·ICLR 2026 Poster
The paper presents a clear motivation and strong practical relevance — training large Transformers is often constrained by activation memory, and this work directly addresses a crucial scalability challenge. The proposed method introduces learnable projection mappings that preserve gradient reconstructability, an elegant and theoretically grounded idea that distinguishes itself from prior techniques such as activation checkpointing or reversible layers. The empirical results are convincing, demo
While the proposed PAMM framework is conceptually sound and practically motivated, the experimental evaluation is somewhat limited. The current experiments are mainly conducted on mid-scale Transformer models, and it remains unclear how the method scales to very large architectures (e.g., >1B parameters) or longer sequence settings. Moreover, the ablation analysis on the projection dimension ratio and per-layer compression strategies is rather sparse—more systematic studies could strengthen the
* The paper is well written, clearly structured, and easy to follow. * The proposed method is supported by solid theoretical analysis. * The approach effectively reduces training-time memory consumption while maintaining, or even slightly improving, model accuracy with minimal degradation.
* **Limited profiling on sequence redundancy:** The paper offers initial empirical evidence of sequence‑axis redundancy (Appendix F: PCA clustering; relative error and coverage), but the analysis is confined to a narrow slice (one layer/model/step). A broader study would better ground the motivation. * **Missing complexity/scaling analysis:** While the runtime breakdown is helpful (Table 2), the paper lacks an explicit complexity analysis of the compression and approximate multiply and a scalin
- The proposed method identifies the redundancy in the sequence dimension, which is typically enormous in modern LLM training. - Outperforms other methods like CompAct and Uniform-CRS in memory-performance tradeoffs.
- the experiments are limited to small to medium size LLMs. Scaling up models can introduce performance degradation of the approximation method. - Activations are dynamic during training. Random sampling is slow, particularly on GPUs. Note that random sampling requires a pseudo-random number generator, such as an LFSR or others. Most algorithms in this family are sequential and model generation from an irreducible in a Galois field. The authors should check the intrinsic details of cudaranddx.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced MEMS and NEMS Technologies
