How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Siyuan Feng; Piotr \.Zelasko; Laureano Moro-Vel\'azquez; Ali; Abavisani; Mark Hasegawa-Johnson; Odette Scharenborg; Najim Dehak

arXiv:2010.12104·eess.AS·June 8, 2021

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Siyuan Feng, Piotr \.Zelasko, Laureano Moro-Vel\'azquez, Ali, Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

PDF

1 Repo

TL;DR

This paper investigates how phonotactic differences impact multilingual and zero-shot speech recognition, revealing that modeling crosslingual phonotactics offers limited benefits and that language-specific data improves zero-shot transfer.

Contribution

It provides an extensive evaluation of phonotactic effects on zero-shot ASR using hybrid models, highlighting the importance of language-specific phonotactic data for transfer performance.

Findings

01

Limited gain from modeling crosslingual phonotactics.

02

Overly strong models can impair zero-shot transfer.

03

Using target language phonotactic data in LM training improves performance.

Abstract

The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pzelasko/kaldi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Model · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention