Transcribing Lyrics From Commercial Song Audio: The First Step Towards   Singing Content Processing

Che-Ping Tsai; Yi-Lin Tuan; Lin-shan Lee

arXiv:1804.05306·cs.SD·April 17, 2018

Transcribing Lyrics From Commercial Song Audio: The First Step Towards Singing Content Processing

Che-Ping Tsai, Yi-Lin Tuan, Lin-shan Lee

PDF

TL;DR

This paper explores the challenge of transcribing lyrics from commercial singing audio, proposing initial methods and reporting a significant reduction in word error rate compared to baseline, marking a step towards singing content processing.

Contribution

It introduces an initial approach to lyrics transcription from singing audio, utilizing TDNN-LSTM models with data augmentation to improve recognition accuracy.

Findings

01

WER reduced from 96.21% to 73.90% with proposed methods

02

Data augmentation with speed perturbation improves recognition

03

Singing content presents unique challenges compared to speech recognition

Abstract

Spoken content processing (such as retrieval and browsing) is maturing, but the singing content is still almost completely left out. Songs are human voice carrying plenty of semantic information just as speech, and may be considered as a special type of speech with highly flexible prosody. The various problems in song audio, for example the significantly changing phone duration over highly flexible pitch contours, make the recognition of lyrics from song audio much more difficult. This paper reports an initial attempt towards this goal. We collected music-removed version of English songs directly from commercial singing content. The best results were obtained by TDNN-LSTM with data augmentation with 3-fold speed perturbation plus some special approaches. The WER achieved (73.90%) was significantly lower than the baseline (96.21%), but still relatively high.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings