A Hierarchical Speaker Representation Framework for One-shot Singing   Voice Conversion

Xu Li; Shansong Liu; Ying Shan

arXiv:2206.13762·eess.AS·July 7, 2022

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Xu Li, Shansong Liu, Ying Shan

PDF

Open Access

TL;DR

This paper introduces a hierarchical speaker representation framework for one-shot singing voice conversion, capturing fine-grained speaker characteristics at multiple levels to improve conversion quality with minimal reference audio.

Contribution

It proposes a novel hierarchical framework that models speaker features at different granularities, outperforming existing embedding-based methods in singing voice conversion.

Findings

01

Outperforms LUT and SRN based SVC systems.

02

Supports one-shot SVC with only a few seconds of reference audio.

03

Effectively captures fine-grained speaker characteristics.

Abstract

Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsStable Rank Normalization