Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR   in Transfer Learning

Zhiping Zeng; Van Tung Pham; Haihua Xu; Yerbolat Khassanov; Eng Siong; Chng; Chongjia Ni; Bin Ma

arXiv:2005.10407·eess.AS·May 29, 2020

Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Zhiping Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong, Chng, Chongjia Ni, Bin Ma

PDF

Open Access

TL;DR

This paper introduces a hybrid Transformer-LSTM architecture for low-resource end-to-end speech recognition that effectively leverages extra text data and transfer learning to significantly improve accuracy and inference speed.

Contribution

The paper proposes a novel hybrid Transformer-LSTM model that combines encoding and language modeling for improved low-resource ASR performance with transfer learning.

Findings

01

24.2% relative WER reduction over previous LSTM-based architecture

02

25.4% relative WER reduction via transfer learning from a resource-rich language

03

11.9% relative WER improvement over vanilla Transformer ASR

Abstract

In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout