Audio to score matching by combining phonetic and duration information
Rong Gong, Jordi Pons, Xavier Serra

TL;DR
This paper presents a novel approach for matching jingju singing audio to scores by integrating phonetic and duration information, improving accuracy over methods relying solely on melodic contours.
Contribution
It introduces a combined phonetic and duration-based matching method using CNNs, DNNs, GMMs, and compares duration models, specifically tailored for jingju a cappella singing.
Findings
CNNs outperform DNNs and GMMs on small datasets
HSMM outperforms post-processor duration models
Combining phonetic and duration info improves matching accuracy
Abstract
We approach the singing phrase audio to score matching problem by using phonetic and duration information - with a focus on studying the jingju a cappella singing case. We argue that, due to the existence of a basic melodic contour for each mode in jingju music, only using melodic information (such as pitch contour) will result in an ambiguous matching. This leads us to propose a matching approach based on the use of phonetic and duration information. Phonetic information is extracted with an acoustic model shaped with our data, and duration information is considered with the Hidden Markov Models (HMMs) variants we investigate. We build a model for each lyric path in our scores and we achieve the matching by ranking the posterior probabilities of the decoded most likely state sequences. Three acoustic models are investigated: (i) convolutional neural networks (CNNs), (ii) deep neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
