Decoupled Structure for Improved Adaptability of End-to-End Models
Keqi Deng, Philip C. Woodland

TL;DR
This paper introduces decoupled end-to-end speech recognition models that enable flexible domain adaptation by replacing the internal language model without re-training, improving performance across different domains.
Contribution
It proposes a novel decoupled structure for E2E ASR models that allows direct internal LM replacement for domain adaptation, enhancing versatility and robustness.
Findings
Achieved 15.1% relative WER reduction on TED-LIUM 2
Achieved 17.2% relative WER reduction on AESRC2020
Maintained intra-domain performance
Abstract
Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data To solve this problem, this paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
