Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech
Youngjae Kim, Yejin Jeon, Gary Geunbae Lee

TL;DR
This paper introduces a novel audio-based linguistic feature extraction method that improves multi-lingual and low-resource text-to-speech systems by better capturing language representations and enabling unseen language synthesis.
Contribution
It presents a new technique for extracting linguistic features directly from audio, enhancing multi-lingual TTS and low-resource language transfer learning.
Findings
Effective in multi-lingual TTS applications
Superior performance in low-resource transfer learning
Outperforms existing methods in unseen language synthesis
Abstract
The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a novel method that directly extracts linguistic features from audio input while effectively filtering out miscellaneous acoustic information including speaker-specific attributes like timbre. Subjective and objective evaluations affirm the effectiveness of our approach for multi-lingual text-to-speech, and highlight its superiority in low-resource transfer learning for previously unseen language.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
