Decoupled Structure for Improved Adaptability of End-to-End Models

Keqi Deng; Philip C. Woodland

arXiv:2308.13345·eess.AS·August 28, 2023

Decoupled Structure for Improved Adaptability of End-to-End Models

Keqi Deng, Philip C. Woodland

PDF

Open Access

TL;DR

This paper introduces decoupled end-to-end speech recognition models that enable flexible domain adaptation by replacing the internal language model without re-training, improving performance across different domains.

Contribution

It proposes a novel decoupled structure for E2E ASR models that allows direct internal LM replacement for domain adaptation, enhancing versatility and robustness.

Findings

01

Achieved 15.1% relative WER reduction on TED-LIUM 2

02

Achieved 17.2% relative WER reduction on AESRC2020

03

Maintained intra-domain performance

Abstract

Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data To solve this problem, this paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling