Deep Learning based Multi-Source Localization with Source Splitting and   its Effectiveness in Multi-Talker Speech Recognition

Aswin Shanmugam Subramanian; Chao Weng; Shinji Watanabe; Meng Yu; Dong; Yu

arXiv:2102.07955·eess.AS·November 30, 2021·1 cites

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition

Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Dong, Yu

PDF

Open Access

TL;DR

This paper introduces a deep learning method for multi-source localization that improves multi-talker speech recognition by accurately estimating speaker directions and integrating this information into ASR systems.

Contribution

It proposes a source splitting neural network with utterance-level prediction and a novel loss function, significantly enhancing localization accuracy and ASR performance.

Findings

01

Achieved 6.3% WER on simulated mixtures with localization input

02

Outperformed baseline ASR without localization features

03

Validated effectiveness on real overlapping speech data

Abstract

Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate representations inside the network. This allows our model to give source-specific posteriors as the output unlike the traditional multi-label classification approach. Existing deep learning methods perform a frame level prediction, whereas our approach performs an utterance level prediction by incorporating temporal selection and averaging inside the network to avoid post-processing. We also experiment with various loss functions and show that a variant of earth mover distance (EMD) is very…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing