Robust Singing Voice Transcription Serves Synthesis

Ruiqi Li; Yu Zhang; Yongqi Wang; Zhiqing Hong; Rongjie Huang; Zhou; Zhao

arXiv:2405.09940·eess.AS·June 4, 2024

Robust Singing Voice Transcription Serves Synthesis

Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou, Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces ROSVOT, a robust note-level singing voice transcription model that improves accuracy and robustness for practical singing dataset annotation, enhancing Singing Voice Synthesis applications.

Contribution

The paper presents ROSVOT, the first robust AST model tailored for SVS, with a multi-scale framework and attention-based pitch decoder, enabling reliable transcription in real-world conditions.

Findings

01

Achieves state-of-the-art accuracy on clean and noisy data

02

Outperforms baseline SVS models when trained on enlarged datasets

03

Demonstrates practical applicability with real-world annotations

Abstract

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Robust Singing Voice Transcription Serves Synthesis· underline

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing