Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments
Renana Opochinsky, Mordehay Moradi, Sharon Gannot

TL;DR
This paper introduces Sep-TFAnet, a single-microphone speech separation network with TF attention, designed for noisy, reverberant environments, incorporating VAD for improved performance in real-world human-robot interactions.
Contribution
The work presents a novel TF attention-based separation network with VAD integration, tailored for online operation and improved generalization to real-world acoustic conditions.
Findings
Outperforms competing methods in noisy and reverberant environments
Demonstrates better generalization to real recordings from a humanoid robot
Supports online processing suitable for human-robot interactions
Abstract
Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed , which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing
