The Academia Sinica Systems of Voice Conversion for VCC2020

Yu-Huai Peng; Cheng-Hung Hu; Alexander Kang; Hung-Shin Lee; Pin-Yuan; Chen; Yu Tsao; Hsin-Min Wang

arXiv:2010.02669·eess.AS·October 7, 2020

The Academia Sinica Systems of Voice Conversion for VCC2020

Yu-Huai Peng, Cheng-Hung Hu, Alexander Kang, Hung-Shin Lee, Pin-Yuan, Chen, Yu Tsao, Hsin-Min Wang

PDF

Open Access

TL;DR

This paper presents the Academia Sinica voice conversion systems for VCC2020, utilizing a cascaded ASR+TTS approach with phonetic tokens, achieving competitive results in both intra- and cross-lingual tasks.

Contribution

The paper introduces a novel cascaded ASR+TTS framework using phonetic tokens, including unsupervised VQVAE features for cross-lingual voice conversion.

Findings

01

Systems performed well in VCC2020 evaluations.

02

Use of IPA and VQVAE phonetic tokens improved conversion quality.

03

Achieved competitive results in both tasks.

Abstract

This paper describes the Academia Sinica systems for the two tasks of Voice Conversion Challenge 2020, namely voice conversion within the same language (Task 1) and cross-lingual voice conversion (Task 2). For both tasks, we followed the cascaded ASR+TTS structure, using phonetic tokens as the TTS input instead of the text or characters. For Task 1, we used the international phonetic alphabet (IPA) as the input of the TTS model. For Task 2, we used unsupervised phonetic symbols extracted by the vector-quantized variational autoencoder (VQVAE). In the evaluation, the listening test showed that our systems performed well in the VCC2020 challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques