Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks
Akihiro Kato, Tomi Kinnunen

TL;DR
This paper introduces a waveform-to-sinusoid regression method using recurrent neural networks for robust F0 estimation in noisy speech, outperforming existing approaches especially at low SNRs.
Contribution
It proposes a novel waveform-to-sinusoid regression approach with RNNs for noise-robust F0 estimation, achieving higher accuracy than classification-based DNN methods.
Findings
Improves GPE and FPE by over 35% at -10 to +10 dB SNR.
Outperforms state-of-the-art DNN-based F0 trackers by more than 15%.
Demonstrates robustness across various noise conditions.
Abstract
The fundamental frequency (F0) represents pitch in speech that determines prosodic characteristics of speech and is needed in various tasks for speech analysis and synthesis. Despite decades of research on this topic, F0 estimation at low signal-to-noise ratios (SNRs) in unexpected noise conditions remains difficult. This work proposes a new approach to noise robust F0 estimation using a recurrent neural network (RNN) trained in a supervised manner. Recent studies employ deep neural networks (DNNs) for F0 tracking as a frame-by-frame classification task into quantised frequency states but we propose waveform-to-sinusoid regression instead to achieve both noise robustness and accurate estimation with increased frequency resolution. Experimental results with PTDB-TUG corpus contaminated by additive noise (NOISEX-92) demonstrate that the proposed method improves gross pitch error (GPE)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
