On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR
Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar

TL;DR
This paper investigates how layer pruning in the Whisper encoder affects SLAM-ASR performance and how LoRA fine-tuning can recover or improve accuracy, demonstrating minimal WER increase and parameter reduction.
Contribution
It provides a detailed analysis of encoder layer pruning effects in SLAM-ASR and shows that LoRA fine-tuning can effectively compensate for pruning-induced performance loss.
Findings
Pruning two encoder layers causes only 2-4% WER increase.
Combining pruning with LoRA outperforms the unpruned baseline.
LoRA reduces word errors by 11-21% in high-resource languages.
Abstract
Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
