Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition
Qiujia Li, Chao Zhang, Philip C. Woodland

TL;DR
This paper introduces a two-pass speech recognition system that combines frame-synchronous and label-synchronous models, achieving significant WER reductions by leveraging their complementary strengths without extra data.
Contribution
It proposes a novel rescoring approach that integrates frame-synchronous and label-synchronous ASR systems, improving accuracy without additional training data.
Findings
Achieves up to 29% relative WER reduction on AMI dataset.
Attains up to 33% relative WER reduction on Switchboard and RT03 datasets.
Demonstrates the effectiveness of combining different ASR paradigms for improved performance.
Abstract
Commonly used automatic speech recognition (ASR) systems can be classified into frame-synchronous and label-synchronous categories, based on whether the speech is decoded on a per-frame or per-label basis. Frame-synchronous systems, such as traditional hidden Markov model systems, can easily incorporate existing knowledge and can support streaming ASR applications. Label-synchronous systems, based on attention-based encoder-decoder models, can jointly learn the acoustic and language information with a single model, which can be regarded as audio-grounded language models. In this paper, we propose rescoring the N-best hypotheses or lattices produced by a first-pass frame-synchronous system with a label-synchronous system in a second-pass. By exploiting the complementary modelling of the different approaches, the combined two-pass systems achieve competitive performance without using any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
