When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

Akif Islam; Raufun Nahar; Md. Ekramul Hamid

arXiv:2603.04710·cs.SD·March 6, 2026

When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

Akif Islam, Raufun Nahar, Md. Ekramul Hamid

PDF

Open Access

TL;DR

This paper investigates the impact of speech enhancement using SAM-Audio on zero-shot ASR with Whisper, revealing that improved perceptual audio quality can actually harm recognition accuracy, especially with larger models.

Contribution

The study provides a systematic empirical analysis showing that denoising with SAM-Audio degrades zero-shot ASR performance, challenging assumptions about perceptual quality and recognition accuracy.

Findings

01

SAM-Audio preprocessing increases WER and CER despite better signal quality.

02

Recognition errors worsen with larger Whisper models.

03

Perceptually cleaner audio does not necessarily improve ASR performance.

Abstract

Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation