Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling
Shao-Syuan Huang, Kuan-Po Huang, Andy T. Liu, Hung-yi Lee

TL;DR
This paper proposes methods to improve multilingual ASR systems' performance on unseen languages by leveraging language embeddings and predicted language probabilities, demonstrating significant experimental gains.
Contribution
It introduces a weighted sum and predictor-based approach to utilize language embeddings for unseen languages, enhancing ASR accuracy beyond existing models.
Findings
Significant improvements in zero-shot ASR performance on unseen languages.
The proposed methods outperform baseline approaches in experiments.
Enhanced ASR accuracy in both zero-shot and fine-tuning scenarios.
Abstract
Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
