Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent
Yusheng Tian, Junbin Liu, Tan Lee

TL;DR
This paper introduces a human-in-the-loop coordinate descent method for personalized voice synthesis that enables vocally disabled individuals to restore their voices without prior recordings by iteratively refining speaker embeddings guided by perception.
Contribution
It presents a novel iterative approach leveraging perceptually meaningful speaker embeddings for personalized voice synthesis without reference speech data.
Findings
Effective in approximating target voices across diverse cases
Embeddings correspond to perceptual voice attributes
User-guided refinement improves synthesis quality
Abstract
This paper describes a human-in-the-loop approach to personalized voice synthesis in the absence of reference speech data from the target speaker. It is intended to help vocally disabled individuals restore their lost voices without requiring any prior recordings. The proposed approach leverages a learned speaker embedding space. Starting from an initial voice, users iteratively refine the speaker embedding parameters through a coordinate descent-like process, guided by auditory perception. By analyzing the latent space, it is noted that that the embedding parameters correspond to perceptual voice attributes, including pitch, vocal tension, brightness, and nasality, making the search process intuitive. Computer simulations and real-world user studies demonstrate that the proposed approach is effective in approximating target voices across a diverse range of test cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsSparse Evolutionary Training
