Parameter-efficient Adaptation of Multilingual Multimodal Models for   Low-resource ASR

Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi

arXiv:2410.13445·cs.CL·October 18, 2024

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Abhishek Gupta, Amruta Parulekar, Sameep Chattopadhyay, Preethi Jyothi

PDF

Open Access 1 Video

TL;DR

This paper explores combining parameter-efficient fine-tuning and text-only adaptation in multilingual multimodal models to improve low-resource ASR, achieving significant WER reductions through cross-lingual transfer without labeled speech.

Contribution

It demonstrates effective integration of text-only adaptation with parameter-efficient fine-tuning in a multilingual multimodal model for low-resource ASR, including zero-shot transfer.

Findings

01

Up to 17% relative WER reduction in zero-shot transfer

02

Effective combination of adaptation techniques boosts low-resource ASR performance

03

Cross-lingual transfer from high-resource languages is successful

Abstract

Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR· underline

Taxonomy

TopicsSpeech Recognition and Synthesis