Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Chin-Yun Yu, Marco A. Mart\'inez-Ram\'irez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, Gy\"orgy Fazekas

TL;DR
This paper enhances style transfer for vocal effects by integrating a Gaussian prior into inference-time optimisation, significantly improving realism and accuracy over previous methods.
Contribution
It introduces a Gaussian prior based on a vocal preset dataset into the inference-time optimisation process, enabling more realistic and accurate vocal effects transfer.
Findings
Parameter mean squared error reduced by up to 33%
Significant improvements in style matching metrics
Subjective evaluations favor the proposed method
Abstract
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to an audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can result in unrealistic configurations or biased outcomes. We address this pitfall by introducing a Gaussian prior derived from the DiffVox vocal preset dataset over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
