RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Anders R. Bargum; Simon Lajboschitz; Cumhur Erkut

arXiv:2408.16546·cs.SD·August 30, 2024

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Anders R. Bargum, Simon Lajboschitz, Cumhur Erkut

PDF

Open Access

TL;DR

This paper introduces RAVE for Speech, a high-sampling-rate voice conversion method that emphasizes model simplicity and efficiency, achieving naturalness comparable to state-of-the-art while reducing inference time.

Contribution

It proposes a novel time-domain voice conversion approach using speech representation learning and latent space guidance for high-quality, efficient conversion.

Findings

01

Achieves comparable naturalness and quality to state-of-the-art methods.

02

Significantly reduces inference time.

03

Struggles with similarity to unseen speakers.

Abstract

Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing