TL;DR
This paper investigates how phonotactic differences impact multilingual and zero-shot speech recognition, revealing that modeling crosslingual phonotactics offers limited benefits and that language-specific data improves zero-shot transfer.
Contribution
It provides an extensive evaluation of phonotactic effects on zero-shot ASR using hybrid models, highlighting the importance of language-specific phonotactic data for transfer performance.
Findings
Limited gain from modeling crosslingual phonotactics.
Overly strong models can impair zero-shot transfer.
Using target language phonotactic data in LM training improves performance.
Abstract
The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Model · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention
