Multi-head Temporal Latent Attention
Keqi Deng, Philip C. Woodland

TL;DR
This paper introduces Multi-head Temporal Latent Attention (MTLA), a method that compresses the Key-Value cache in Transformer models along the temporal dimension, significantly improving inference speed and memory efficiency across various speech and text tasks.
Contribution
MTLA employs a hyper-network for dynamic merging of KV vectors and a stride-aware causal mask, enabling efficient parallel training and inference with reduced memory footprint.
Findings
Achieves 5.3x faster inference in speech translation
Reduces GPU memory usage by a factor of 8.3
Maintains competitive performance across multiple tasks
Abstract
While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax
