Multi-head Temporal Latent Attention

Keqi Deng; Philip C. Woodland

arXiv:2505.13544·cs.LG·November 4, 2025

Multi-head Temporal Latent Attention

Keqi Deng, Philip C. Woodland

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Multi-head Temporal Latent Attention (MTLA), a method that compresses the Key-Value cache in Transformer models along the temporal dimension, significantly improving inference speed and memory efficiency across various speech and text tasks.

Contribution

MTLA employs a hyper-network for dynamic merging of KV vectors and a stride-aware causal mask, enabling efficient parallel training and inference with reduced memory footprint.

Findings

01

Achieves 5.3x faster inference in speech translation

02

Reduces GPU memory usage by a factor of 8.3

03

Maintains competitive performance across multiple tasks

Abstract

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

d-keqi/mlta
pytorchOfficial

Videos

Multi-head Temporal Latent Attention· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax