Single-Microphone Speaker Separation and Voice Activity Detection in   Noisy and Reverberant Environments

Renana Opochinsky; Mordehay Moradi; Sharon Gannot

arXiv:2401.03448·eess.AS·January 9, 2024·1 cites

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Renana Opochinsky, Mordehay Moradi, Sharon Gannot

PDF

Open Access

TL;DR

This paper introduces Sep-TFAnet, a single-microphone speech separation network with TF attention, designed for noisy, reverberant environments, incorporating VAD for improved performance in real-world human-robot interactions.

Contribution

The work presents a novel TF attention-based separation network with VAD integration, tailored for online operation and improved generalization to real-world acoustic conditions.

Findings

01

Outperforms competing methods in noisy and reverberant environments

02

Demonstrates better generalization to real recordings from a humanoid robot

03

Supports online processing suitable for human-robot interactions

Abstract

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $Sep-TFAnet^{VAD}$ , which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing