From English to More Languages: Parameter-Efficient Model Reprogramming   for Cross-Lingual Speech Recognition

Chao-Han Huck Yang; Bo Li; Yu Zhang; Nanxin Chen; Rohit Prabhavalkar,; Tara N. Sainath; Trevor Strohman

arXiv:2301.07851·cs.SD·June 30, 2023

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar,, Tara N. Sainath, Trevor Strohman

PDF

Open Access

TL;DR

This paper introduces a parameter-efficient neural reprogramming framework that adapts English ASR models for multilingual speech recognition, achieving competitive results with significantly fewer trainable parameters.

Contribution

It presents a novel reprogramming approach with auxiliary architectures for cross-lingual ASR, reducing training costs and outperforming existing tuning methods.

Findings

01

Achieves 8.1%-11.9% WER on multilingual LibriSpeech with only 4.2%-6.8% of parameters trained.

02

Outperforms existing ASR tuning architectures and self-supervised extension methods.

03

Enables effective monolingual and multilingual speech recognition with large-scale pre-trained models.

Abstract

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques