End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions
Aviad Eisenberg, Sharon Gannot, Shlomo E. Chazan

TL;DR
This paper presents a multi-microphone speaker extraction method that utilizes instantaneous relative transfer functions, outperforming traditional spatial cues like DOA in reverberant environments.
Contribution
It introduces a novel RTF-based spatial cue for end-to-end multi-microphone speaker extraction, demonstrating superior performance over DOA and spectral embedding methods.
Findings
RTF-based cue outperforms DOA-based cue in experiments.
Using spatial cues improves extraction performance.
RTF-based method is effective in reverberant environments.
Abstract
This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
