On Modular Training of Neural Acoustics-to-Word Model for LVCSR
Zhehuai Chen, Qi Liu, Hao Li, Kai Yu

TL;DR
This paper introduces a modular training framework for neural acoustics-to-word speech recognition models, enabling separate training of acoustic and language components while maintaining end-to-end inference, leading to improved performance and efficiency.
Contribution
It proposes a novel modular training approach with separate acoustic and language models, integrated via a phone synchronous decoding module, enhancing training efficiency and recognition accuracy.
Findings
Significant performance improvement over direct A2W models on Switchboard.
Enhanced training and decoding efficiency.
Effective integration of separate acoustic and language models.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training a single model which integrates acoustic and language model into a whole. Although E2E training benefits from sequence modeling and simplified decoding pipelines, large amount of transcribed acoustic data is usually required, and traditional acoustic and language modelling techniques cannot be utilized. In this paper, a novel modular training framework of E2E ASR is proposed to separately train neural acoustic and language models during training stage, while still performing end-to-end inference in decoding stage. Here, an acoustics-to-phoneme model (A2P) and a phoneme-to-word model (P2W) are trained using acoustic data and text data respectively. A phone synchronous decoding (PSD) module is inserted between A2P and P2W to reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
