Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent

Yusheng Tian; Junbin Liu; Tan Lee

arXiv:2408.17068·eess.AS·May 27, 2025

Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent

Yusheng Tian, Junbin Liu, Tan Lee

PDF

Open Access

TL;DR

This paper introduces a human-in-the-loop coordinate descent method for personalized voice synthesis that enables vocally disabled individuals to restore their voices without prior recordings by iteratively refining speaker embeddings guided by perception.

Contribution

It presents a novel iterative approach leveraging perceptually meaningful speaker embeddings for personalized voice synthesis without reference speech data.

Findings

01

Effective in approximating target voices across diverse cases

02

Embeddings correspond to perceptual voice attributes

03

User-guided refinement improves synthesis quality

Abstract

This paper describes a human-in-the-loop approach to personalized voice synthesis in the absence of reference speech data from the target speaker. It is intended to help vocally disabled individuals restore their lost voices without requiring any prior recordings. The proposed approach leverages a learned speaker embedding space. Starting from an initial voice, users iteratively refine the speaker embedding parameters through a coordinate descent-like process, guided by auditory perception. By analyzing the latent space, it is noted that that the embedding parameters correspond to perceptual voice attributes, including pitch, vocal tension, brightness, and nasality, making the search process intuitive. Computer simulations and real-world user studies demonstrate that the proposed approach is effective in approximating target voices across a diverse range of test cases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsSparse Evolutionary Training