Singing voice phoneme segmentation by hierarchically inferring syllable   and phoneme onset positions

Rong Gong; Xavier Serra

arXiv:1806.01665·cs.SD·June 6, 2018·1 cites

Singing voice phoneme segmentation by hierarchically inferring syllable and phoneme onset positions

Rong Gong, Xavier Serra

PDF

Open Access 3 Repos

TL;DR

This paper introduces a language-independent, hierarchical approach for singing voice phoneme segmentation using CNNs and HMMs, achieving superior results on a new jingju singing dataset.

Contribution

It presents a novel two-step method combining CNN-based onset detection with duration-informed HMM inference, without relying on phoneme labels.

Findings

01

Outperforms baseline HSMM forced alignment in segmentation accuracy

02

Effective in language-independent singing voice phoneme segmentation

03

Validated on a newly collected jingju singing dataset

Abstract

In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using language-independent information -- onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calculate the syllable and phoneme onset detection functions (ODFs) using a convolutional neural network (CNN). In the second step, the syllable and phoneme boundaries and labels are inferred hierarchically by using a duration-informed hidden Markov model (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and the ODFs as the emission probabilities into the HMM. The proposed method is designed in a language-independent way such that no phoneme class labels are used. For the model training and algorithm evaluation, we collect a new jingju (also known as Beijing or Peking opera) solo singing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing