Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong   General Audio Event Taggers

Yuan Gong; Sameer Khurana; Leonid Karlinsky; and James Glass

arXiv:2307.03183·cs.SD·October 10, 2023

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass

PDF

Open Access 1 Repo

TL;DR

Whisper-AT enhances the Whisper speech recognition model by enabling it to also perform audio event tagging with minimal additional computational cost, leveraging its noise-related audio representations.

Contribution

This paper introduces Whisper-AT, a unified model that combines speech recognition and audio event tagging by freezing Whisper's backbone and training a lightweight tagger.

Findings

01

Whisper's audio representations are highly correlated with non-speech sounds.

02

Whisper-AT achieves audio event recognition with less than 1% extra computation.

03

Whisper-AT performs both speech recognition and audio tagging in a single pass.

Abstract

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuanGongND/whisper-at
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing