A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion
Xu Li, Shansong Liu, Ying Shan

TL;DR
This paper introduces a hierarchical speaker representation framework for one-shot singing voice conversion, capturing fine-grained speaker characteristics at multiple levels to improve conversion quality with minimal reference audio.
Contribution
It proposes a novel hierarchical framework that models speaker features at different granularities, outperforming existing embedding-based methods in singing voice conversion.
Findings
Outperforms LUT and SRN based SVC systems.
Supports one-shot SVC with only a few seconds of reference audio.
Effectively captures fine-grained speaker characteristics.
Abstract
Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsStable Rank Normalization
