A Comparative Analysis on ASR System Combination for Attention, CTC, Factored Hybrid, and Transducer Models
Noureldin Bayoumi, Robin Schmitt, Tina Raissi, Albert Zeyer, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper compares various ASR system combination methods across different architectures, demonstrating how leveraging diverse models can improve speech recognition performance through a consistent two-pass rescoring approach.
Contribution
It introduces a systematic comparison of model combination techniques across multiple ASR architectures using a two-pass rescoring method for consistent evaluation.
Findings
Model combination improves recognition accuracy.
Two-pass rescoring ensures fair comparison across systems.
Different architectures benefit from complementary model integration.
Abstract
Combination approaches for speech recognition (ASR) systems cover structured sentence-level or word-based merging techniques as well as combination of model scores during beam search. In this work, we compare model combination across popular ASR architectures. Our method leverages the complementary strengths of different models in exploring diverse portions of the search space. We rescore a joint hypothesis list of two model candidates. We then identify the best hypothesis through log-linear combination of these sequence-level scores. While model combination during first-pass recognition may yield improved performance, it introduces variability due to differing decoding methods, making direct comparison more challenging. Our two-pass method ensures consistent comparisons across all system combination results presented in this study. We evaluate model pair candidates with varying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
