Toward Expressive Singing Voice Correction: On Perceptual Validity of Evaluation Metrics for Vocal Melody Extraction
Yin-Jyun Luo, Yuen-Jen Lin, Li Su

TL;DR
This paper develops an expressive singing voice correction system that improves pitch and rhythm accuracy, and critically evaluates the perceptual validity of existing melody extraction metrics through subjective and objective studies.
Contribution
It introduces an advanced SVC framework integrating singing voice separation and melody extraction, and assesses the perceptual relevance of standard evaluation metrics.
Findings
High pitch accuracy metrics do not correlate with perceptual quality.
The proposed system effectively corrects pitch and rhythm errors.
Standard melody extraction metrics may lack perceptual validity.
Abstract
Singing voice correction (SVC) is an appealing application for amateur singers. Commercial products automate SVC by snapping pitch contours to equal-tempered scales, which could lead to deadpan modifications. Together with the neglect of rhythmic errors, extensive manual corrections are still necessary. In this paper, we present a streamlined system to automate expressive SVC for both pitch and rhythmic errors. Particularly, we extend a previous work by integrating advanced techniques for singing voice separation (SVS) and vocal melody extraction. SVC is achieved by temporally aligning the source-target pair, followed by replacing pitch and rhythm of the source with those of the target. We evaluate the framework by a comparative study for melody extraction which involves both subjective and objective evaluations, whereby we investigate perceptual validity of the standard metrics through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
