Controllable and Interpretable Singing Voice Decomposition via Assem-VC
Kang-wook Kim, Junhyeok Lee

TL;DR
This paper introduces Assem-VC, a system for controllable and interpretable singing voice decomposition and synthesis, enabling personalized singing voice conversion with perfect synchronization.
Contribution
The paper presents a novel singing voice decomposition method that encodes linguistic content, pitch, and speaker identity for high-quality voice conversion.
Findings
Achieved perfectly synchronized duet singing voices.
Successfully converted singing voices with target speaker embedding.
Demonstrated controllable and interpretable voice synthesis.
Abstract
We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
