P2VA: Converting Persona Descriptions into Voice Attributes for Fair and Controllable Text-to-Speech
Yejin Lee, Jaehoon Kang, Kyuhong Shim

TL;DR
This paper introduces P2VA, a novel framework that automatically converts persona descriptions into voice attributes for fair and controllable text-to-speech, addressing usability gaps in voice personalization.
Contribution
P2VA is the first framework to link persona descriptions directly to voice synthesis, employing two strategies for structured and rich style voice attribute extraction.
Findings
P2VA-C reduces WER by 5%.
P2VA improves MOS by 0.33 points.
Current LLMs embed societal biases in voice attributes.
Abstract
While persona-driven large language models (LLMs) and prompt-based text-to-speech (TTS) systems have advanced significantly, a usability gap arises when users attempt to generate voices matching their desired personas from implicit descriptions. Most users lack specialized knowledge to specify detailed voice attributes, which often leads TTS systems to misinterpret their expectations. To address these gaps, we introduce Persona-to-Voice-Attribute (P2VA), the first framework enabling voice generation automatically from persona descriptions. Our approach employs two strategies: P2VA-C for structured voice attributes, and P2VA-O for richer style descriptions. Evaluation shows our P2VA-C reduces WER by 5% and improves MOS by 0.33 points. To the best of our knowledge, P2VA is the first framework to establish a connection between persona and voice synthesis. In addition, we discover that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Speech and dialogue systems
MethodsFocus
