Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature
Kyungguen Byun, Sunkuk Moon, and Erik Visser

TL;DR
This paper introduces a diffusion-based voice conversion system that enables detailed control over prosody at the frame level, including pitch, energy, and speaking rate, while maintaining high speech quality.
Contribution
The proposed model uniquely integrates frame-level prosody features with diffusion-based decoding for precise voice and prosody manipulation, including a self-supervised post-processing step for speaking rate control.
Findings
Comparable speech quality to state-of-the-art methods
Improved intelligibility in converted speech
Effective modulation of pitch, energy, and speed
Abstract
We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker, emotion, and speaking style. However, manipulating the frame-level prosody, such as pitch, energy and speaking rate, still remains challenging. Our proposed model utilizes a frame-level prosody feature to effectively transfer such properties. Specifically, pitch and energy trajectories are integrated in a prosody conditioning module and then fed alongside speaker and contents embeddings to a diffusion-based decoder generating a converted speech mel-spectrogram. To adjust the speaking rate, our system includes a self-supervised model based post-processing step which allows improved controllability. The proposed model showed comparable speech quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
