Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition
Guoli Ye, Vadim Mazalov, Jinyu Li, Yifan Gong

TL;DR
This paper introduces a two-pass cascading framework combining hybrid and end-to-end speech recognition models to leverage their respective strengths, achieving significant error rate reductions and maintaining customization capabilities.
Contribution
The paper proposes a novel two-pass hybrid and E2E cascading framework that integrates hybrid and E2E models for improved speech recognition performance.
Findings
Achieves 8-10% relative WER reduction over individual models
Maintains hybrid system advantages like customization and segmentation
Second pass E2E model is robust to first pass hybrid model variations
Abstract
Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results. By jointly modeling audio and text, the E2E model performs better in matched scenarios and scales well with a large amount of paired audio-text training data. The modularized hybrid model is easier for customization, and better to make use of a massive amount of unpaired text data. This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model in order to take advantage of both sides, with hybrid in the first pass and E2E in the second pass. We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system. More importantly, compared with the pure E2E system, we show the proposed system has the potential to keep the advantages of hybrid system, e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
