Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in   Any-to-One Voice Conversion

Giuseppe Ruggiero; Matteo Testa; Jurgen Van de Walle; Luigi Di Caro

arXiv:2409.17387·cs.SD·September 27, 2024

Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion

Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro

PDF

Open Access

TL;DR

This paper presents a novel cross-lingual voice conversion system that creates native-sounding polyglot voices by leveraging self-supervised learning and fine-tuning, improving speech quality and accent preservation without extensive multilingual data.

Contribution

Introduces a cross-lingual any-to-one voice conversion method with a new fine-tuning strategy that enhances accent accuracy and reduces data needs, outperforming existing techniques.

Findings

01

Improved speech intelligibility and quality in cross-lingual voice conversion

02

Effective accent preservation without multilingual target data

03

State-of-the-art performance confirmed by evaluations

Abstract

The creation of artificial polyglot voices remains a challenging task, despite considerable progress in recent years. This paper investigates self-supervised learning for voice conversion to create native-sounding polyglot voices. We introduce a novel cross-lingual any-to-one voice conversion system that is able to preserve the source accent without the need for multilingual data from the target speaker. In addition, we show a novel cross-lingual fine-tuning strategy that further improves the accent and reduces the training data requirements. Objective and subjective evaluations with English, Spanish, French and Mandarin Chinese confirm that our approach improves on state-of-the-art methods, enhancing the speech intelligibility and overall quality of the converted speech, especially in cross-lingual scenarios. Audio samples are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing