Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors
Kanru Hua

TL;DR
This paper introduces a novel F0 and voicing detection method that models feature extractor behavior under noise using Gaussian mixture models trained on synthetic data, improving accuracy in speech analysis.
Contribution
It presents a new approach that models feature extractor behavior with GMMs trained on artificial data, avoiding direct speech modeling and enhancing robustness.
Findings
Lower gross error rates than state-of-the-art methods.
Effective training on synthetic data.
Improved F0 estimation accuracy.
Abstract
A F0 and voicing status estimation algorithm for high quality speech analysis/synthesis is proposed. This problem is approached from a different perspective that models the behavior of feature extractors under noise, instead of directly modeling speech signals. Under time-frequency locality assumptions, the joint distribution of extracted features and target F0 can be characterized by training a bank of Gaussian mixture models (GMM) on artificial data generated from Monte-Carlo simulations. The trained GMMs can then be used to generate a set of conditional distributions on the predicted F0, which are then combined and post-processed by Viterbi algorithm to give a final F0 trajectory. Evaluation on CSTR and CMU Arctic speech databases shows that the proposed method, trained on fully synthetic data, achieves lower gross error rates than state-of-the-art methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
