TL;DR
This paper introduces Tensor Product Attention (TPA), a memory-efficient attention mechanism using tensor decompositions, enabling longer sequence processing in language models without sacrificing performance.
Contribution
The paper proposes TPA, a novel attention method that reduces memory overhead and integrates with rotary embeddings, leading to a new architecture T6 that outperforms or matches existing models.
Findings
T6 surpasses standard Transformer baselines in language modeling tasks.
TPA significantly reduces KV cache size, enabling longer sequence processing.
T6 maintains competitive performance while improving memory and computational efficiency.
Abstract
Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer
