Rapid Language Adaptation for Multilingual E2E Speech Recognition Using   Encoder Prompting

Yosuke Kashiwagi; Hayato Futami; Emiru Tsunoo; Siddhant Arora; Shinji; Watanabe

arXiv:2406.12611·cs.SD·June 19, 2024

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji, Watanabe

PDF

Open Access

TL;DR

This paper introduces an encoder prompting method within the self-conditioned CTC framework, enabling rapid language adaptation in multilingual speech recognition models, significantly reducing errors especially for low-resource languages.

Contribution

It proposes a novel encoder prompting technique for CTC models, allowing zero-shot language adaptation in multilingual speech recognition systems.

Findings

01

Achieved 28% average error reduction across languages.

02

Reduced errors by 41% on low-resource languages.

03

Demonstrated effectiveness of encoder prompting in zero-shot scenarios.

Abstract

End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing