Learning to Separate Voices by Spatial Regions
Zhongweiyang Xu, Romit Roy Choudhury

TL;DR
This paper introduces a self-supervised, region-based voice separation method for binaural audio, improving personalization and handling multiple sources without fixed source number assumptions.
Contribution
It proposes a novel two-stage self-supervised framework that learns spatial region properties for personalized voice separation, relaxing fixed source number constraints.
Findings
Region-wise separation improves handling multiple sources.
Personalized models outperform generic supervised models.
Promising results in real-world applications like noise cancellation.
Abstract
We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating sources with 2 microphones) they assume a known or fixed maximum number of sources, K. Moreover, today's models are trained in a supervised manner, using training data synthesized from generic sources, environments, and human head shapes. This paper intends to relax both these constraints at the expense of a slight alteration in the problem definition. We observe that, when a received mixture contains too many sources, it is still helpful to separate them by region, i.e., isolating signal mixtures from each conical sector around the user's head. This requires learning the fine-grained spatial properties of each region, including the signal distortions imposed by a person's head. We propose a two-stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
