Analysis and transformations of voice level in singing voice
Frederik Bous, Axel Roebel

TL;DR
This paper presents a neural auto-encoder framework for transforming and estimating voice levels in singing recordings, utilizing a novel voice level estimator and a recording factor to improve dynamic control and disentanglement.
Contribution
It introduces a neural voice level estimator and a method to incorporate voice level control into a singing voice auto-encoder, addressing the challenge of unannotated recordings.
Findings
The voice level models accurately encode true voice level information.
Perceptual tests show that the model's transformations align with perceived dynamic changes.
The approach effectively disentangles voice level from other spectral features.
Abstract
We introduce a neural auto-encoder that transforms the musical dynamic in recordings of singing voice via changes in voice level. Since most recordings of singing voice are not annotated with voice level we propose a means to estimate the voice level from the signal's timbre using a neural voice level estimator. We introduce the recording factor that relates the voice level to the recorded signal power as a proportionality constant. This unknown constant depends on the recording conditions and the post-processing and may thus be different for each recording (but is constant across each recording). We provide two approaches to estimate the voice level without knowing the recording factor. The unknown recording factor can either be learned alongside the weights of the voice level estimator, or a special loss function based on the scalar product can be used to only match the contour of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
