Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion
Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si

TL;DR
Whisper-MLA introduces a novel architecture that reduces GPU memory usage in ASR models by replacing Multi-Head Attention with Multi-Head Latent Attention, achieving significant memory savings while maintaining accuracy.
Contribution
The paper proposes Whisper-MLA, a new model that adapts MLA into Whisper, enabling substantial memory reduction with minimal fine-tuning and broad application across attention modules.
Findings
KV cache size reduced by up to 87.5%
Maintains competitive accuracy on LibriSpeech
Allows efficient conversion of pretrained models
Abstract
The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
