Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Kang-wook Kim; Junhyeok Lee

arXiv:2110.12676·eess.AS·October 26, 2021

Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Kang-wook Kim, Junhyeok Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces Assem-VC, a system for controllable and interpretable singing voice decomposition and synthesis, enabling personalized singing voice conversion with perfect synchronization.

Contribution

The paper presents a novel singing voice decomposition method that encodes linguistic content, pitch, and speaker identity for high-quality voice conversion.

Findings

01

Achieved perfectly synchronized duet singing voices.

02

Successfully converted singing voices with target speaker embedding.

03

Demonstrated controllable and interpretable voice synthesis.

Abstract

We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mindslab-ai/assem-vc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing