Distance Based Single-Channel Target Speech Extraction

Runwu Shi; Benjamin Yen; Kazuhiro Nakadai

arXiv:2412.20144·eess.AS·December 31, 2024

Distance Based Single-Channel Target Speech Extraction

Runwu Shi, Benjamin Yen, Kazuhiro Nakadai

PDF

Open Access

TL;DR

This paper introduces a novel single-channel target speech extraction method that exclusively uses distance information, demonstrating effectiveness in various scenarios and enabling speaker distance estimation without relying on speaker physiological data.

Contribution

It is the first to utilize only distance cues for single-channel speech extraction, integrating distance information with time-frequency analysis for improved separation.

Findings

01

Effective in single-room and multi-room scenarios

02

Capable of estimating speaker distances in mixed speech

03

Demonstrates feasibility and robustness of the approach

Abstract

This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis