Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual   and Low-Resource Text-to-Speech

Youngjae Kim; Yejin Jeon; Gary Geunbae Lee

arXiv:2409.18622·cs.SD·September 30, 2024

Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech

Youngjae Kim, Yejin Jeon, Gary Geunbae Lee

PDF

Open Access

TL;DR

This paper introduces a novel audio-based linguistic feature extraction method that improves multi-lingual and low-resource text-to-speech systems by better capturing language representations and enabling unseen language synthesis.

Contribution

It presents a new technique for extracting linguistic features directly from audio, enhancing multi-lingual TTS and low-resource language transfer learning.

Findings

01

Effective in multi-lingual TTS applications

02

Superior performance in low-resource transfer learning

03

Outperforms existing methods in unseen language synthesis

Abstract

The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a novel method that directly extracts linguistic features from audio input while effectively filtering out miscellaneous acoustic information including speaker-specific attributes like timbre. Subjective and objective evaluations affirm the effectiveness of our approach for multi-lingual text-to-speech, and highlight its superiority in low-resource transfer learning for previously unseen language.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis