Independent language modeling architecture for end-to-end ASR

Van Tung Pham; Haihua Xu; Yerbolat Khassanov; Zhiping Zeng; Eng Siong; Chng; Chongjia Ni; Bin Ma; Haizhou Li

arXiv:1912.00863·cs.CL·December 3, 2019·6 cites

Independent language modeling architecture for end-to-end ASR

Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong, Chng, Chongjia Ni, Bin Ma, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end ASR architecture that decouples the language model from the encoder, enabling independent training and effective use of external text data, resulting in significant error rate reductions.

Contribution

The paper proposes a new architecture that separates the language model from the encoder in end-to-end ASR, allowing independent training and external data integration.

Findings

01

Achieved 9.3% relative character error rate reduction on Mandarin HKUST.

02

Achieved 22.8% relative word error rate reduction on English NSC.

03

Effective use of external text data improves ASR performance.

Abstract

The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing