Two-pass Decoding and Cross-adaptation Based System Combination of   End-to-end Conformer and Hybrid TDNN ASR Systems

Mingyu Cui; Jiajun Deng; Shoukang Hu; Xurong Xie; Tianzi Wang; Shujie; Hu; Mengzhe Geng; Boyang Xue; Xunying Liu; Helen Meng

arXiv:2206.11596·eess.AS·June 26, 2023

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie, Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng

PDF

Open Access

TL;DR

This paper explores system combination techniques for hybrid TDNN and Conformer end-to-end ASR systems, demonstrating significant WER improvements through multi-pass rescoring and cross-adaptation on the Switchboard corpus.

Contribution

It introduces novel multi-pass rescoring and cross-adaptation methods to effectively combine hybrid and end-to-end ASR systems, achieving improved recognition accuracy.

Findings

01

Significant WER reductions of 2.5% to 3.9% absolute over Conformer alone.

02

Combined systems outperform individual systems on multiple evaluation datasets.

03

Multi-pass rescoring yields the best performance among the proposed methods.

Abstract

Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings