Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Sen Zhang; Jianguo Wei; Wenhuan Lu; Xianghu Yue; Wei Li; Qiang Li; Pengcheng Zhao; Ming Cai; Luo Si

arXiv:2603.00563·cs.SD·March 3, 2026

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si

PDF

Open Access

TL;DR

Whisper-MLA introduces a novel architecture that reduces GPU memory usage in ASR models by replacing Multi-Head Attention with Multi-Head Latent Attention, achieving significant memory savings while maintaining accuracy.

Contribution

The paper proposes Whisper-MLA, a new model that adapts MLA into Whisper, enabling substantial memory reduction with minimal fine-tuning and broad application across attention modules.

Findings

01

KV cache size reduced by up to 87.5%

02

Maintains competitive accuracy on LibriSpeech

03

Allows efficient conversion of pretrained models

Abstract

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing