In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion
Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu

TL;DR
This paper introduces TES-VC, a novel text-driven voice conversion framework that independently controls speaker timbre and environmental acoustics, achieving high-quality, contextually appropriate speech synthesis.
Contribution
The paper presents a new framework that enables independent control of speaker and environment attributes in speech conversion using text inputs and latent diffusion modeling.
Findings
Effective generation of speech matching target environment and timbre
High content retention in converted speech
Superior controllability over speech attributes
Abstract
We propose TES-VC (Text-driven Environment and Speaker controllable Voice Conversion), a text-driven voice conversion framework with independent control of speaker timbre and environmental acoustics. TES-VC processes simultaneous text inputs for target voice and environment, accurately generating speech matching described timbre/environment while preserving source content. Trained on synthetic data with decoupled vocal/environment features via latent diffusion modeling, our method eliminates interference between attributes. The Retrieval-Based Timbre Control (RBTC) module enables precise manipulation using abstract descriptions without paired data. Experiments confirm TES-VC effectively generates contextually appropriate speech in both timbre and environment with high content retention and superior controllability which demonstrates its potential for widespread applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders
