Multilingual End-to-End Speech Recognition with A Single Transformer on   Low-Resource Languages

Shiyu Zhou; Shuang Xu; Bo Xu

arXiv:1806.05059·eess.AS·June 15, 2018·62 cites

Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages

Shiyu Zhou, Shuang Xu, Bo Xu

PDF

Open Access

TL;DR

This paper demonstrates that a single Transformer model can effectively perform multilingual low-resource speech recognition using sub-words without pronunciation lexicons, and that incorporating language information improves accuracy.

Contribution

It introduces a multilingual Transformer-based ASR model for low-resource languages that employs sub-words and integrates language info to enhance recognition performance.

Findings

01

Single Transformer performs well on low-resource languages despite language confusion.

02

Inserting language symbols at sequence ends yields better WER reduction.

03

Language information inclusion leads to approximately 10.5-12.4% WER improvement.

Abstract

Sequence-to-sequence attention-based models integrate an acoustic, pronunciation and language model into a single neural network, which make them very suitable for multilingual automatic speech recognition (ASR). In this paper, we are concerned with multilingual speech recognition on low-resource languages by a single Transformer, one of sequence-to-sequence attention-based models. Sub-words are employed as the multilingual modeling unit without using any pronunciation lexicon. First, we show that a single multilingual ASR Transformer performs well on low-resource languages despite of some language confusion. We then look at incorporating language information into the model by inserting the language symbol at the beginning or at the end of the original sub-words sequence under the condition of language information being known during training. Experiments on CALLHOME datasets demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax