Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass

TL;DR
Whisper-AT enhances the Whisper speech recognition model by enabling it to also perform audio event tagging with minimal additional computational cost, leveraging its noise-related audio representations.
Contribution
This paper introduces Whisper-AT, a unified model that combines speech recognition and audio event tagging by freezing Whisper's backbone and training a lightweight tagger.
Findings
Whisper's audio representations are highly correlated with non-speech sounds.
Whisper-AT achieves audio event recognition with less than 1% extra computation.
Whisper-AT performs both speech recognition and audio tagging in a single pass.
Abstract
In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
