End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer   Functions

Aviad Eisenberg; Sharon Gannot; Shlomo E. Chazan

arXiv:2502.06285·cs.SD·February 11, 2025

End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions

Aviad Eisenberg, Sharon Gannot, Shlomo E. Chazan

PDF

Open Access

TL;DR

This paper presents a multi-microphone speaker extraction method that utilizes instantaneous relative transfer functions, outperforming traditional spatial cues like DOA in reverberant environments.

Contribution

It introduces a novel RTF-based spatial cue for end-to-end multi-microphone speaker extraction, demonstrating superior performance over DOA and spectral embedding methods.

Findings

01

RTF-based cue outperforms DOA-based cue in experiments.

02

Using spatial cues improves extraction performance.

03

RTF-based method is effective in reverberant environments.

Abstract

This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis