Have best of both worlds: two-pass hybrid and E2E cascading framework   for speech recognition

Guoli Ye; Vadim Mazalov; Jinyu Li; Yifan Gong

arXiv:2110.04891·cs.CL·February 23, 2022·1 cites

Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition

Guoli Ye, Vadim Mazalov, Jinyu Li, Yifan Gong

PDF

Open Access

TL;DR

This paper introduces a two-pass cascading framework combining hybrid and end-to-end speech recognition models to leverage their respective strengths, achieving significant error rate reductions and maintaining customization capabilities.

Contribution

The paper proposes a novel two-pass hybrid and E2E cascading framework that integrates hybrid and E2E models for improved speech recognition performance.

Findings

01

Achieves 8-10% relative WER reduction over individual models

02

Maintains hybrid system advantages like customization and segmentation

03

Second pass E2E model is robust to first pass hybrid model variations

Abstract

Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results. By jointly modeling audio and text, the E2E model performs better in matched scenarios and scales well with a large amount of paired audio-text training data. The modularized hybrid model is easier for customization, and better to make use of a massive amount of unpaired text data. This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model in order to take advantage of both sides, with hybrid in the first pass and E2E in the second pass. We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system. More importantly, compared with the pure E2E system, we show the proposed system has the potential to keep the advantages of hybrid system, e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing