FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition
Szu-Jui Chen, Jiamin Xie, John H.L. Hansen

TL;DR
This paper introduces FeaRLESS, a feature refinement loss that improves the combination of diverse self-supervised learning representations in end-to-end speech recognition, leading to better performance on benchmark datasets.
Contribution
The study proposes a novel feature refinement loss for decorrelation, enhancing the fusion of multiple SSLRs in speech recognition models.
Findings
FeaRLESS outperforms systems without feature refinement loss on WSJ and FSC datasets.
Correlations exist between different SSLRs, which can be exploited for better feature fusion.
The proposed method improves robustness and accuracy in end-to-end speech recognition.
Abstract
Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study, we propose to investigate the effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models. In addition, we will show there are correlations between these extracted SSLRs. As such, we further propose a feature refinement loss for decorrelation to efficiently combine the set of input features. For evaluation, we show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
