Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus
Szu-Jui Chen, John H.L. Hansen

TL;DR
This paper explores combining self-supervised learning features with a novel deep cross-attention fusion method to improve speech recognition on naturalistic datasets, achieving a 1.1% WER reduction.
Contribution
It introduces a new deep cross-attention fusion technique for SSL features, enhancing speech recognition performance on the Fearless Steps Apollo corpus.
Findings
The fusion method outperforms previous feature refinement approaches.
Achieved a 1.1% absolute improvement in WER on the FSC Phase-4 corpus.
Demonstrated effectiveness across diverse naturalistic speech datasets.
Abstract
Using self-supervised learning (SSL) models has significantly improved performance for downstream speech tasks, surpassing the capabilities of traditional hand-crafted features. This study investigates the amalgamation of SSL models, with the aim to leverage both their individual strengths and refine extracted features to achieve improved speech recognition models for naturalistic scenarios. Our research investigates the massive naturalistic Fearless Steps (FS) APOLLO resource, with particular focus on the FS Challenge (FSC) Phase-4 corpus, providing the inaugural analysis of this dataset. Additionally, we incorporate the CHiME-6 dataset to evaluate performance across diverse naturalistic speech scenarios. While exploring previously proposed Feature Refinement Loss and fusion methods, we found these methods to be less effective on the FSC Phase-4 corpus. To address this, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
