Independent language modeling architecture for end-to-end ASR
Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong, Chng, Chongjia Ni, Bin Ma, Haizhou Li

TL;DR
This paper introduces a novel end-to-end ASR architecture that decouples the language model from the encoder, enabling independent training and effective use of external text data, resulting in significant error rate reductions.
Contribution
The paper proposes a new architecture that separates the language model from the encoder in end-to-end ASR, allowing independent training and external data integration.
Findings
Achieved 9.3% relative character error rate reduction on Mandarin HKUST.
Achieved 22.8% relative word error rate reduction on English NSC.
Effective use of external text data improves ASR performance.
Abstract
The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
