Exploiting spatial information with the informed complex-valued spatial autoencoder for target speaker extraction
Annika Briegleb, Mhd Modar Halimeh, Walter Kellermann

TL;DR
This paper enhances neural spatial filtering for target speaker extraction by extending COSPA with spatial awareness, leading to more effective and interpretable separation of target speakers in multichannel audio.
Contribution
The paper introduces iCOSPA, an informed complex-valued spatial autoencoder that incorporates target speaker position, improving spatial selectivity and extraction performance.
Findings
iCOSPA effectively extracts target speakers from mixtures.
The architecture learns pronounced spatial selectivity patterns.
Performance depends on training target and reference signal.
Abstract
In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the spatial filtering performed by such a time-varying spectro-spatial filter. We extend the recently proposed complex-valued spatial autoencoder (COSPA) for the task of target speaker extraction by leveraging its interpretable structure and purposefully informing the network of the target speaker's position. We show that the resulting informed COSPA (iCOSPA) effectively and flexibly extracts a target speaker from a mixture of speakers. We also find that the proposed architecture is well capable of learning pronounced spatial selectivity patterns and show that the results depend significantly on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Underwater Acoustics Research · Speech Recognition and Synthesis
