Leveraging Language Information for Target Language Extraction
Mehmet Sinan Y{\i}ld{\i}r{\i}m, Ruijie Tao, Wupeng Wang, Junyi Ao, Haizhou Li

TL;DR
This paper introduces a new end-to-end framework that leverages speech pre-trained models to improve target language extraction from multilingual audio mixtures, demonstrating significant performance gains.
Contribution
It proposes a novel approach that uses language knowledge from pre-trained models to enhance extraction accuracy, and provides the first multilingual dataset for this task.
Findings
Achieves over 1.2 dB SI-SNR improvement for English and German extraction.
Constructs the first publicly available multilingual dataset for Target Language Extraction.
Demonstrates the effectiveness of language knowledge guidance in speech extraction.
Abstract
Target Language Extraction aims to extract speech in a specific language from a mixture waveform that contains multiple speakers speaking different languages. The human auditory system is adept at performing this task with the knowledge of the particular language. However, the performance of the conventional extraction systems is limited by the lack of this prior knowledge. Speech pre-trained models, which capture rich linguistic and phonetic representations from large-scale in-the-wild corpora, can provide this missing language knowledge to these systems. In this work, we propose a novel end-to-end framework to leverage language knowledge from speech pre-trained models. This knowledge is used to guide the extraction model to better capture the target language characteristics, thereby improving extraction quality. To demonstrate the effectiveness of our proposed approach, we construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
