Singing voice phoneme segmentation by hierarchically inferring syllable and phoneme onset positions
Rong Gong, Xavier Serra

TL;DR
This paper introduces a language-independent, hierarchical approach for singing voice phoneme segmentation using CNNs and HMMs, achieving superior results on a new jingju singing dataset.
Contribution
It presents a novel two-step method combining CNN-based onset detection with duration-informed HMM inference, without relying on phoneme labels.
Findings
Outperforms baseline HSMM forced alignment in segmentation accuracy
Effective in language-independent singing voice phoneme segmentation
Validated on a newly collected jingju singing dataset
Abstract
In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using language-independent information -- onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calculate the syllable and phoneme onset detection functions (ODFs) using a convolutional neural network (CNN). In the second step, the syllable and phoneme boundaries and labels are inferred hierarchically by using a duration-informed hidden Markov model (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and the ODFs as the emission probabilities into the HMM. The proposed method is designed in a language-independent way such that no phoneme class labels are used. For the model training and algorithm evaluation, we collect a new jingju (also known as Beijing or Peking opera) solo singing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
