The Use of Voice Source Features for Sung Speech Recognition
Gerardo Roa Dabike, Jon Barker

TL;DR
This study investigates whether vocal source features like pitch, shimmer, and jitter can enhance sung speech recognition, finding limited overall improvements but some benefits in phoneme discrimination.
Contribution
It demonstrates the impact of vocal source features on sung speech recognition and highlights differences from spoken speech, providing insights for future acoustic modeling.
Findings
Pitch plus voicing reduces WER in small training sets.
Voice quality features do not significantly improve recognition.
Voicing features aid voiced/unvoiced phoneme discrimination.
Abstract
In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus, using a state of the art acoustic model and augmenting conventional features with various voice source parameters. Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of voicing produces a significant decrease in WER from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
