Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion
Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva

TL;DR
Phoneme Hallucinator is a one-shot voice conversion model that generates high-fidelity target speaker phonemes from minimal data, enabling effective and versatile voice conversion without text or extensive target data.
Contribution
It introduces a novel phoneme hallucination approach for one-shot voice conversion, balancing intelligibility and speaker similarity without text annotations.
Findings
Outperforms existing VC methods in intelligibility and speaker similarity
Operates with only 3 seconds of target speaker voice data
Supports any-to-any voice conversion without text annotations
Abstract
Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Voice and Speech Disorders
