Exploring the Integration of Speech Separation and Recognition with   Self-Supervised Learning Representation

Yoshiki Masuyama; Xuankai Chang; Wangyou Zhang; Samuele Cornell,; Zhong-Qiu Wang; Nobutaka Ono; Yanmin Qian; Shinji Watanabe

arXiv:2307.12231·cs.SD·July 25, 2023·1 cites

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell,, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

PDF

Open Access

TL;DR

This paper investigates integrating speech separation with recognition using self-supervised learning representations, demonstrating significant WER improvements in noisy, reverberant multi-speaker scenarios.

Contribution

It introduces a novel training strategy combining TF-GridNet and WavLM SSLR for improved multi-speaker recognition in challenging environments.

Findings

01

Achieved 2.5% WER on reverberant WHAMR! test set.

02

Outperformed previous mask-based MVDR beamforming with filterbank features.

03

Validated effectiveness of SSLR in multi-speaker ASR tasks.

Abstract

Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation