Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation
Shaoshi Ling, Guoli Ye, Rui Zhao, Yifan Gong

TL;DR
This paper introduces the HAED model, which separates acoustic and language models in speech recognition, enabling efficient text-based adaptation and achieving significant WER improvements on out-of-domain data.
Contribution
The proposed HAED model maintains modularity, allowing effective text adaptation in AED systems, which was challenging with end-to-end joint optimization.
Findings
23% relative WER reduction with out-of-domain text adaptation
Minor WER degradation on general test set
Preserves modularity of traditional hybrid systems
Abstract
The attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effective, quick and inexpensive adaptation with text input has become a primary concern for deploying AED systems in the industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 23% relative Word Error Rate (WER) improvements when out-of-domain text data is used for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing
