Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion
Muhammad Umar Farooq, Darshan Adiga Haniya Narayana, Thomas Hain

TL;DR
This paper introduces a novel hybrid DNN-HMM fusion method for low-resource multilingual speech recognition, using neural networks to transform posteriors between languages, improving recognition accuracy significantly.
Contribution
It proposes a new approach for multilingual acoustic model fusion using neural networks to transform posteriors, enhancing low-resource speech recognition without extensive data.
Findings
Posterior fusion improves recognition accuracy by 14.65% and 6.5%.
Neural network transformations require limited data.
Cross-lingual fusion achieves comparable results without language-specific posteriors.
Abstract
Multilingual speech recognition has drawn significant attention as an effective way to compensate data scarcity for low-resource languages. End-to-end (e2e) modelling is preferred over conventional hybrid systems, mainly because of no lexicon requirement. However, hybrid DNN-HMMs still outperform e2e models in limited data scenarios. Furthermore, the problem of manual lexicon creation has been alleviated by publicly available trained models of grapheme-to-phoneme (G2P) and text to IPA transliteration for a lot of languages. In this paper, a novel approach of hybrid DNN-HMM acoustic models fusion is proposed in a multilingual setup for the low-resource languages. Posterior distributions from different monolingual acoustic models, against a target language speech signal, are fused together. A separate regression neural network is trained for each source-target language pair to transform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
