Late fusion ensembles for speech recognition on diverse input audio   representations

Marin Jezid\v{z}i\'c; Matej Mihel\v{c}i\'c

arXiv:2412.01861·eess.AS·December 4, 2024

Late fusion ensembles for speech recognition on diverse input audio representations

Marin Jezid\v{z}i\'c, Matej Mihel\v{c}i\'c

PDF

Open Access

TL;DR

This paper investigates how late fusion ensembles of E-Branchformer models trained on diverse speech audio representations can improve automatic speech recognition performance across multiple benchmark datasets, achieving up to 14% gains.

Contribution

It demonstrates the effectiveness of ensemble methods with diverse input representations in enhancing speech recognition accuracy on standard datasets.

Findings

01

Ensembles improve ASR performance by 1-14% over state-of-the-art models.

02

Diverse audio representations contribute significantly to ensemble gains.

03

Ensemble benefits persist even when using language models.

Abstract

We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: \textit{Librispeech, Aishell, Gigaspeech}, \textit{TEDLIUMv2} and show that improvements of $1% - 14%$ can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsE-Branchformer · Balanced Selection