Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

Chin-Yun Yu; Marco A. Mart\'inez-Ram\'irez; Junghyun Koo; Wei-Hsiang Liao; Yuki Mitsufuji; Gy\"orgy Fazekas

arXiv:2505.11315·cs.SD·October 20, 2025

Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

Chin-Yun Yu, Marco A. Mart\'inez-Ram\'irez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, Gy\"orgy Fazekas

PDF

Open Access 1 Repo

TL;DR

This paper enhances style transfer for vocal effects by integrating a Gaussian prior into inference-time optimisation, significantly improving realism and accuracy over previous methods.

Contribution

It introduces a Gaussian prior based on a vocal preset dataset into the inference-time optimisation process, enabling more realistic and accurate vocal effects transfer.

Findings

01

Parameter mean squared error reduced by up to 33%

02

Significant improvements in style matching metrics

03

Subjective evaluations favor the proposed method

Abstract

Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to an audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can result in unrealistic configurations or biased outcomes. We address this pitfall by introducing a Gaussian prior derived from the DiffVox vocal preset dataset over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SonyResearch/diffvox
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis