Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets
Yuhang Yang, Yizhou Peng, Xionghu Zhong, Hao Huang, Eng Siong Chng

TL;DR
This study explores adapting OpenAI's Whisper model for Mandarin-English code-switch speech recognition, demonstrating that minimal adaptation data can significantly improve performance across different datasets and prompting strategies.
Contribution
It provides empirical evidence on effective adaptation of Whisper with limited data and various prompts for code-switch speech recognition.
Findings
As little as 1-10 hours of adaptation data can saturate performance on SEAME.
More than 100 hours of data improve results on ASRU2019.
Adapting Whisper with code-switch data consistently enhances recognition accuracy.
Abstract
This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt. The Mixed Error Rate results show that the amount of adaptation data may be as low as hours to achieve saturation in performance gain (SEAME) while the ASRU task continued to show performance with more adaptation data (100 hours). For the language prompt, the results show that although various prompting strategies initially produce different outcomes, adapting the Whisper model with code-switch data uniformly improves its performance. These results may be relevant also to the community when applying Whisper for related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
