BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge

Martin Kocour; Guillermo C\'ambara; Jordi Luque; David Bonet; Mireia; Farr\'us; Martin Karafi\'at; Karel Vesel\'y; Jan ''Honza'' \^Cernock\'y

arXiv:2101.12729·eess.AS·February 1, 2021

BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge

Martin Kocour, Guillermo C\'ambara, Jordi Luque, David Bonet, Mireia, Farr\'us, Martin Karafi\'at, Karel Vesel\'y, Jan ''Honza'' \^Cernock\'y

PDF

Open Access

TL;DR

This paper presents a comprehensive ASR system for the Albayzin 2020 Challenge, combining hybrid and end-to-end models, source separation, and data enhancement techniques to achieve competitive word error rates.

Contribution

It introduces a fusion of hybrid and end-to-end ASR models with source separation and data augmentation for improved speech recognition performance.

Findings

01

Achieved 23.33% WER on the challenge dataset.

02

Demucs source separation improves recognition in noisy environments.

03

SpecAugment and language models enhance hybrid model accuracy.

Abstract

This paper describes joint effort of BUT and Telef\'onica Research on development of Automatic Speech Recognition systems for Albayzin 2020 Challenge. We compare approaches based on either hybrid or end-to-end models. In hybrid modelling, we explore the impact of SpecAugment layer on performance. For end-to-end modelling, we used a convolutional neural network with gated linear units (GLUs). The performance of such model is also evaluated with an additional n-gram language model to improve word error rates. We further inspect source separation methods to extract speech from noisy environment (i.e. TV shows). More precisely, we assess the effect of using a neural-based music separator named Demucs. A fusion of our best systems achieved 23.33% WER in official Albayzin 2020 evaluations. Aside from techniques used in our final submitted systems, we also describe our efforts in retrieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing