On the Transferability of Whisper-based Representations for   "In-the-Wild" Cross-Task Downstream Speech Applications

Vamsikrishna Chemudupati; Marzieh Tahaei; Heitor Guimaraes; Arthur; Pimentel; Anderson Avila; Mehdi Rezagholizadeh; Boxing Chen; Tiago Falk

arXiv:2305.14546·eess.AS·May 25, 2023·1 cites

On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications

Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, Arthur, Pimentel, Anderson Avila, Mehdi Rezagholizadeh, Boxing Chen, Tiago Falk

PDF

Open Access

TL;DR

This paper investigates the transferability and robustness of Whisper's speech representations across various real-world tasks and noisy environments, demonstrating its potential for practical, cross-task speech applications.

Contribution

It is the first comprehensive study evaluating Whisper's representations beyond ASR, including robustness in noisy and reverberant conditions across multiple speech tasks.

Findings

01

Whisper achieves promising results across multiple tasks.

02

Whisper's representations are robust in noisy and reverberant environments.

03

Potential for real-world deployment of Whisper-based systems.

Abstract

Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for ``in the wild'' tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing