Human Voice Pitch Estimation: A Convolutional Network with Auto-Labeled and Synthetic Data
Jeremy Cochoy

TL;DR
This paper introduces a convolutional neural network for human voice pitch estimation that leverages auto-labeled and synthetic data, achieving robust performance across diverse audio datasets in music and voice applications.
Contribution
It presents a novel CNN architecture trained on combined synthetic and auto-labeled data for improved pitch extraction from human singing voices.
Findings
Effective across synthetic and real-world datasets
Outperforms traditional pitch estimation methods
Robust in diverse singing and speech scenarios
Abstract
In the domain of music and sound processing, pitch extraction plays a pivotal role. Our research presents a specialized convolutional neural network designed for pitch extraction, particularly from the human singing voice in acapella performances. Notably, our approach combines synthetic data with auto-labeled acapella sung audio, creating a robust training environment. Evaluation across datasets comprising synthetic sounds, opera recordings, and time-stretched vowels demonstrates its efficacy. This work paves the way for enhanced pitch extraction in both music and voice settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
