Deep Audio-Visual Singing Voice Transcription based on Self-Supervised   Learning Models

Xiangming Gu; Wei Zeng; Jianan Zhang; Longshen Ou; Ye Wang

arXiv:2304.12082·cs.SD·April 25, 2023·1 cites

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Xiangming Gu, Wei Zeng, Jianan Zhang, Longshen Ou, Ye Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal, self-supervised learning approach for singing voice transcription that leverages audio and video data, significantly improving noise robustness and reducing the need for annotated data.

Contribution

It presents a novel multimodal dataset and adapts self-supervised models from speech to singing voice transcription, enhancing performance and generalization.

Findings

01

Audio-only system outperforms state-of-the-art methods.

02

Video-only system achieves about 80% accuracy in note detection.

03

Audio-visual system significantly improves noise robustness.

Abstract

Singing voice transcription converts recorded singing audio to musical notation. Sound contamination (such as accompaniment) and lack of annotated data make singing voice transcription an extremely difficult task. We take two approaches to tackle the above challenges: 1) introducing multimodal learning for singing voice transcription together with a new multimodal singing dataset, N20EMv2, enhancing noise robustness by utilizing video information (lip movements to predict the onset/offset of notes), and 2) adapting self-supervised learning models from the speech domain to the singing voice transcription task, significantly reducing annotated data requirements while preserving pretrained features. We build a self-supervised learning based audio-only singing voice transcription system, which not only outperforms current state-of-the-art technologies as a strong baseline, but also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guxm2021/svt_speechbrain
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis