Can Whisper perform speech-based in-context learning?

Siyin Wang; Chao-Han Huck Yang; Ji Wu; Chao Zhang

arXiv:2309.07081·eess.AS·March 21, 2024·2 cites

Can Whisper perform speech-based in-context learning?

Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

PDF

Open Access

TL;DR

This paper explores the in-context learning capabilities of Whisper ASR models, introducing a speech-based in-context learning method that improves recognition accuracy without gradient updates, especially for dialects.

Contribution

It proposes a novel speech-based in-context learning approach for test-time adaptation of Whisper models, enhancing speech recognition across dialects without gradient descent.

Findings

01

Achieved an average 32.3% relative WER reduction on Chinese dialects.

02

Further improved WER reduction to 36.4% using k-NN-based example selection.

03

Demonstrated effective speaker adaptation and continuous speech recognition improvements.

Abstract

This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research