AmbiSep: Ambisonic-to-Ambisonic Reverberant Speech Separation Using   Transformer Networks

Adrian Herzog; Srikanth Raj Chetupalli; Emanu\"el A. P. Habets

arXiv:2206.06184·eess.AS·June 14, 2022·IWAENC·1 cites

AmbiSep: Ambisonic-to-Ambisonic Reverberant Speech Separation Using Transformer Networks

Adrian Herzog, Srikanth Raj Chetupalli, Emanu\"el A. P. Habets

PDF

Open Access

TL;DR

AmbiSep introduces a transformer-based neural network approach for blind separation of reverberant speech signals in Ambisonic recordings, significantly improving signal quality while maintaining spatial cues.

Contribution

This work presents a novel Ambisonic-to-Ambisonic speech separation method using transformer networks with triple-path processing, advancing blind multichannel speech separation techniques.

Findings

01

Achieves 17.7 dB SI-SDR improvement on blind test set.

02

Effectively preserves spatial characteristics of separated sounds.

03

Demonstrates the effectiveness of transformer-based masking in multichannel speech separation.

Abstract

Consider a multichannel Ambisonic recording containing a mixture of several reverberant speech signals. Retreiving the reverberant Ambisonic signals corresponding to the individual speech sources blindly from the mixture is a challenging task as it requires to estimate multiple signal channels for each source. In this work, we propose AmbiSep, a deep neural network-based plane-wave domain masking approach to solve this task. The masking network uses learned feature representations and transformers in a triple-path processing configuration. We train and evaluate the proposed network architecture on a spatialized WSJ0-2mix dataset, and show that the method achieves a multichannel scale-invariant signal-to-distortion ratio improvement of 17.7 dB on the blind test set, while preserving the spatial characteristics of the separated sounds.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsTest