Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical   Properties of Feature Extractors

Kanru Hua

arXiv:1710.11317·eess.AS·June 7, 2018·1 cites

Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors

Kanru Hua

PDF

Open Access

TL;DR

This paper introduces a novel F0 and voicing detection method that models feature extractor behavior under noise using Gaussian mixture models trained on synthetic data, improving accuracy in speech analysis.

Contribution

It presents a new approach that models feature extractor behavior with GMMs trained on artificial data, avoiding direct speech modeling and enhancing robustness.

Findings

01

Lower gross error rates than state-of-the-art methods.

02

Effective training on synthetic data.

03

Improved F0 estimation accuracy.

Abstract

A F0 and voicing status estimation algorithm for high quality speech analysis/synthesis is proposed. This problem is approached from a different perspective that models the behavior of feature extractors under noise, instead of directly modeling speech signals. Under time-frequency locality assumptions, the joint distribution of extracted features and target F0 can be characterized by training a bank of Gaussian mixture models (GMM) on artificial data generated from Monte-Carlo simulations. The trained GMMs can then be used to generate a set of conditional distributions on the predicted F0, which are then combined and post-processed by Viterbi algorithm to give a final F0 trajectory. Evaluation on CSTR and CMU Arctic speech databases shows that the proposed method, trained on fully synthetic data, achieves lower gross error rates than state-of-the-art methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing