Configurable Multilingual ASR with Speech Summary Representations
Harrison Zhu, Ivan Fung, Yingke Zhu, Lahiru Samarakoon

TL;DR
This paper introduces csvMASR, a configurable multilingual speech recognition model that uses speech summary vectors and adapters to improve recognition accuracy and language classification across multiple languages.
Contribution
The paper proposes a novel architecture, csvMASR, which enhances configurability in multilingual ASR by integrating speech summary vectors and auxiliary language classification.
Findings
csvMASR reduces WER from 10.33% to 9.95% on MLS dataset.
csvMASR outperforms existing MASR models in recognition accuracy.
csvMASR shows superior performance in language classification and prompting tasks.
Abstract
Approximately half of the world's population is multilingual, making multilingual ASR (MASR) essential. Deploying multiple monolingual models is challenging when the ground-truth language is unknown in advance. This motivates research efforts on configurable multilingual MASR models that can be prompted manually or adapted automatically to recognise specific languages. In this paper, we present the Configurable MASR model with Summary Vector (csvMASR), a novel architecture designed to enhance configurability. Our approach leverages adapters and introduces speech summary vector representations, inspired by conversational summary representations in speech diarization, to combine outputs from language-specific components at the utterance level. We also incorporate an auxiliary language classification loss to enhance configurability. Using data from 7 languages in the Multilingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
