Cross-domain Neural Pitch and Periodicity Estimation
Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

TL;DR
This paper introduces advanced neural pitch and periodicity estimation techniques that are fast, accurate, and capable of handling both speech and music data simultaneously, with open-source tools for implementation.
Contribution
It presents novel methods for improving neural pitch estimators, including an entropy-based periodicity extraction and cross-domain training, achieving state-of-the-art performance.
Findings
Achieves real-time speed on CPU and GPU
Outperforms existing pitch estimation methods
Provides open-source toolkit for the community
Abstract
Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve state-of-the-art performance on both speech and music. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle both speech and music data (i.e., cross-domain estimation) without performance degradation. Our estimator implementations run 11.2x faster than real-time on a Intel i9-9820X…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
