Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model   with Frame-level Prosody Feature

Kyungguen Byun; Sunkuk Moon; and Erik Visser

arXiv:2309.03364·cs.SD·September 8, 2023

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

Kyungguen Byun, Sunkuk Moon, and Erik Visser

PDF

Open Access

TL;DR

This paper introduces a diffusion-based voice conversion system that enables detailed control over prosody at the frame level, including pitch, energy, and speaking rate, while maintaining high speech quality.

Contribution

The proposed model uniquely integrates frame-level prosody features with diffusion-based decoding for precise voice and prosody manipulation, including a self-supervised post-processing step for speaking rate control.

Findings

01

Comparable speech quality to state-of-the-art methods

02

Improved intelligibility in converted speech

03

Effective modulation of pitch, energy, and speed

Abstract

We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker, emotion, and speaking style. However, manipulating the frame-level prosody, such as pitch, energy and speaking rate, still remains challenging. Our proposed model utilizes a frame-level prosody feature to effectively transfer such properties. Specifically, pitch and energy trajectories are integrated in a prosody conditioning module and then fed alongside speaker and contents embeddings to a diffusion-based decoder generating a converted speech mel-spectrogram. To adjust the speaking rate, our system includes a self-supervised model based post-processing step which allows improved controllability. The proposed model showed comparable speech quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing