The Cone of Silence: Speech Separation by Localization
Teerapat Jenrungrot, Vivek Jayaram, Steve Seitz, Ira, Kemelmacher-Shlizerman

TL;DR
This paper introduces a deep learning method for localizing and separating multiple speakers in multi-microphone recordings, capable of handling moving speakers and unknown counts with high accuracy even in noisy environments.
Contribution
It presents a waveform-domain deep network that localizes and separates sources within angular regions, enabling efficient binary search for multiple speakers, including unseen and moving ones.
Findings
Achieves state-of-the-art separation and localization performance.
Handles an arbitrary number of moving speakers at test time.
Performs well in high background noise conditions.
Abstract
Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region , given an angle of interest and angular window size . By exponentially decreasing , we can perform a binary search to localize and separate all sources in logarithmic time. Our algorithm allows for an arbitrary number of potentially moving speakers at test time, including more speakers than seen during training. Experiments demonstrate state-of-the-art performance for both source separation and source localization, particularly in high levels of background noise.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Blind Source Separation Techniques
