Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

Siyuan Shan; Yang Li; Amartya Banerjee; Junier B. Oliva

arXiv:2308.06382·cs.SD·January 2, 2024

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva

PDF

Open Access 2 Repos

TL;DR

Phoneme Hallucinator is a one-shot voice conversion model that generates high-fidelity target speaker phonemes from minimal data, enabling effective and versatile voice conversion without text or extensive target data.

Contribution

It introduces a novel phoneme hallucination approach for one-shot voice conversion, balancing intelligibility and speaker similarity without text annotations.

Findings

01

Outperforms existing VC methods in intelligibility and speaker similarity

02

Operates with only 3 seconds of target speaker voice data

03

Supports any-to-any voice conversion without text annotations

Abstract

Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Voice and Speech Disorders