Enhancing Multilingual ASR for Unseen Languages via Language Embedding   Modeling

Shao-Syuan Huang; Kuan-Po Huang; Andy T. Liu; Hung-yi Lee

arXiv:2412.16474·eess.AS·December 24, 2024

Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

Shao-Syuan Huang, Kuan-Po Huang, Andy T. Liu, Hung-yi Lee

PDF

Open Access

TL;DR

This paper proposes methods to improve multilingual ASR systems' performance on unseen languages by leveraging language embeddings and predicted language probabilities, demonstrating significant experimental gains.

Contribution

It introduces a weighted sum and predictor-based approach to utilize language embeddings for unseen languages, enhancing ASR accuracy beyond existing models.

Findings

01

Significant improvements in zero-shot ASR performance on unseen languages.

02

The proposed methods outperform baseline approaches in experiments.

03

Enhanced ASR accuracy in both zero-shot and fine-tuning scenarios.

Abstract

Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems