Late fusion ensembles for speech recognition on diverse input audio representations
Marin Jezid\v{z}i\'c, Matej Mihel\v{c}i\'c

TL;DR
This paper investigates how late fusion ensembles of E-Branchformer models trained on diverse speech audio representations can improve automatic speech recognition performance across multiple benchmark datasets, achieving up to 14% gains.
Contribution
It demonstrates the effectiveness of ensemble methods with diverse input representations in enhancing speech recognition accuracy on standard datasets.
Findings
Ensembles improve ASR performance by 1-14% over state-of-the-art models.
Diverse audio representations contribute significantly to ensemble gains.
Ensemble benefits persist even when using language models.
Abstract
We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: \textit{Librispeech, Aishell, Gigaspeech}, \textit{TEDLIUMv2} and show that improvements of can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsE-Branchformer · Balanced Selection
