FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised   Learning Features in Robust End-to-end Speech Recognition

Szu-Jui Chen; Jiamin Xie; John H.L. Hansen

arXiv:2206.15056·cs.SD·July 1, 2022

FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition

Szu-Jui Chen, Jiamin Xie, John H.L. Hansen

PDF

Open Access

TL;DR

This paper introduces FeaRLESS, a feature refinement loss that improves the combination of diverse self-supervised learning representations in end-to-end speech recognition, leading to better performance on benchmark datasets.

Contribution

The study proposes a novel feature refinement loss for decorrelation, enhancing the fusion of multiple SSLRs in speech recognition models.

Findings

01

FeaRLESS outperforms systems without feature refinement loss on WSJ and FSC datasets.

02

Correlations exist between different SSLRs, which can be exploited for better feature fusion.

03

The proposed method improves robustness and accuracy in end-to-end speech recognition.

Abstract

Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study, we propose to investigate the effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models. In addition, we will show there are correlations between these extracted SSLRs. As such, we further propose a feature refinement loss for decorrelation to efficiently combine the set of input features. For evaluation, we show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing