P2VA: Converting Persona Descriptions into Voice Attributes for Fair and Controllable Text-to-Speech

Yejin Lee; Jaehoon Kang; Kyuhong Shim

arXiv:2505.17093·eess.AS·September 22, 2025

P2VA: Converting Persona Descriptions into Voice Attributes for Fair and Controllable Text-to-Speech

Yejin Lee, Jaehoon Kang, Kyuhong Shim

PDF

Open Access

TL;DR

This paper introduces P2VA, a novel framework that automatically converts persona descriptions into voice attributes for fair and controllable text-to-speech, addressing usability gaps in voice personalization.

Contribution

P2VA is the first framework to link persona descriptions directly to voice synthesis, employing two strategies for structured and rich style voice attribute extraction.

Findings

01

P2VA-C reduces WER by 5%.

02

P2VA improves MOS by 0.33 points.

03

Current LLMs embed societal biases in voice attributes.

Abstract

While persona-driven large language models (LLMs) and prompt-based text-to-speech (TTS) systems have advanced significantly, a usability gap arises when users attempt to generate voices matching their desired personas from implicit descriptions. Most users lack specialized knowledge to specify detailed voice attributes, which often leads TTS systems to misinterpret their expectations. To address these gaps, we introduce Persona-to-Voice-Attribute (P2VA), the first framework enabling voice generation automatically from persona descriptions. Our approach employs two strategies: P2VA-C for structured voice attributes, and P2VA-O for richer style descriptions. Evaluation shows our P2VA-C reduces WER by 5% and improves MOS by 0.33 points. To the best of our knowledge, P2VA is the first framework to establish a connection between persona and voice synthesis. In addition, we discover that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · Speech and dialogue systems

MethodsFocus