In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion

Jiawei Jin; Zhihan Yang; Yixuan Zhou; Zhiyong Wu

arXiv:2506.07036·cs.SD·June 16, 2025

In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion

Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu

PDF

Open Access

TL;DR

This paper introduces TES-VC, a novel text-driven voice conversion framework that independently controls speaker timbre and environmental acoustics, achieving high-quality, contextually appropriate speech synthesis.

Contribution

The paper presents a new framework that enables independent control of speaker and environment attributes in speech conversion using text inputs and latent diffusion modeling.

Findings

01

Effective generation of speech matching target environment and timbre

02

High content retention in converted speech

03

Superior controllability over speech attributes

Abstract

We propose TES-VC (Text-driven Environment and Speaker controllable Voice Conversion), a text-driven voice conversion framework with independent control of speaker timbre and environmental acoustics. TES-VC processes simultaneous text inputs for target voice and environment, accurately generating speech matching described timbre/environment while preserving source content. Trained on synthetic data with decoupled vocal/environment features via latent diffusion modeling, our method eliminates interference between attributes. The Retrieval-Based Timbre Control (RBTC) module enables precise manipulation using abstract descriptions without paired data. Experiments confirm TES-VC effectively generates contextually appropriate speech in both timbre and environment with high content retention and superior controllability which demonstrates its potential for widespread applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders