Wanna hear your voice? A sample is all we need!

The Hieu Pham; Phuong Thanh Tran Nguyen; Xuan Tho Nguyen; Tan Dat Nguyen; Duc Dung Nguyen

arXiv:2410.00527·eess.AS·June 10, 2025

Wanna hear your voice? A sample is all we need!

The Hieu Pham, Phuong Thanh Tran Nguyen, Xuan Tho Nguyen, Tan Dat Nguyen, Duc Dung Nguyen

PDF

Open Access

TL;DR

This paper introduces WHYV, a novel cross-lingual target speaker extraction framework that achieves state-of-the-art zero-shot performance across multiple languages without fine-tuning, addressing low-resource language challenges.

Contribution

The paper presents WHYV, a zero-shot cross-lingual TSE model with a frequency-modulated gating mechanism, enabling effective speaker extraction without language-specific training.

Findings

01

Achieves 13.8 dB on Libri2Mix mix-both

02

Reaches 18.1 dB on mix-clean

03

Attains 14.8 dB on Vietnamese data

Abstract

Research on audio clue-based target speaker extraction (TSE) has focused on modeling mixtures and reference speech, achieving strong results in English due to abundant datasets. However, cross-lingual properties remain underexplored, as low-resource languages face challenges from limited annotated data and linguistic resources. To bridge this gap, we propose WHYV (Wanna Hear Your Voice), a cross-lingual TSE framework enabling zero-shot adaptation without fine-tuning. WHYV employs a frequency-modulated gating mechanism that dynamically adjusts the acoustic features of the target speaker, minimizing reliance on language-specific cues. Evaluations demonstrate state-of-the-art zero-shot performance: 13.8 dB (Libri2Mix mix-both), 18.1 dB (mix-clean), and 14.8 dB on Vietnamese data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis