Distance Based Single-Channel Target Speech Extraction
Runwu Shi, Benjamin Yen, Kazuhiro Nakadai

TL;DR
This paper introduces a novel single-channel target speech extraction method that exclusively uses distance information, demonstrating effectiveness in various scenarios and enabling speaker distance estimation without relying on speaker physiological data.
Contribution
It is the first to utilize only distance cues for single-channel speech extraction, integrating distance information with time-frequency analysis for improved separation.
Findings
Effective in single-room and multi-room scenarios
Capable of estimating speaker distances in mixed speech
Demonstrates feasibility and robustness of the approach
Abstract
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
