ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation

Yuezhang Peng; Yuxin Liu; Yao Li; Sheng Wang; Fei Wen; Xie Chen

arXiv:2512.01267·cs.MM·December 2, 2025

ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation

Yuezhang Peng, Yuxin Liu, Yao Li, Sheng Wang, Fei Wen, Xie Chen

PDF

Open Access

TL;DR

ZO-ASR introduces a memory-efficient zeroth-order fine-tuning method for speech models that eliminates the need for back-propagation, enabling effective adaptation in resource-limited settings.

Contribution

The paper presents ZO-ASR, a novel zeroth-order fine-tuning approach that reduces memory usage and bypasses back-propagation for speech foundation models.

Findings

01

Achieves up to 18.9% relative WER reduction in supervised domain adaptation.

02

Outperforms existing zeroth-order methods in robustness.

03

Moderately lower performance than first-order optimizer in unsupervised test-time adaptation.

Abstract

Fine-tuning pre-trained speech foundation models for Automatic Speech Recognition (ASR) is prevalent, yet constrained by substantial GPU memory requirements. We introduce ZO-ASR, a memory-efficient Zeroth-Order (ZO) method that avoids Back-Propagation (BP) and activation memory by estimating gradients via forward passes. When combined with SGD optimizer, ZO-ASR-SGD fine-tunes ASR models using only inference memory. Our evaluation spans supervised and unsupervised tasks. For Supervised Domain Adaptation on Whisper-Large-V3, ZO-ASR's multiple query mechanism enhances robustness and achieves up to an 18.9\% relative Word Error Rate reduction over zero-shot baselines, outperforming existing ZO methods. For unsupervised Test-Time Adaptation on Wav2Vec2-Base, ZO-ASR exhibits moderately lower performance compared to first-order optimizer Adam. Our BP-free approach provides a viable solution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques