Formant Estimation and Tracking using Probabilistic Heat-Maps
Yosi Shrem, Felix Kreuk, Joseph Keshet

TL;DR
This paper introduces a novel neural network architecture that uses probabilistic heatmaps for more accurate and domain-invariant formant estimation and tracking across diverse speech datasets.
Contribution
A new multi-decoder neural network with shared encoder and heatmap outputs that improves formant estimation across different speaker and speech domains.
Findings
Enhanced formant tracking accuracy across multiple domains
Better domain generalization compared to existing methods
Heatmap-based probability distributions improve robustness
Abstract
Formants are the spectral maxima that result from acoustic resonances of the human vocal tract, and their accurate estimation is among the most fundamental speech processing problems. Recent work has been shown that those frequencies can accurately be estimated using deep learning techniques. However, when presented with a speech from a different domain than that in which they have been trained on, these methods exhibit a decline in performance, limiting their usage as generic tools. The contribution of this paper is to propose a new network architecture that performs well on a variety of different speaker and speech domains. Our proposed model is composed of a shared encoder that gets as input a spectrogram and outputs a domain-invariant representation. Then, multiple decoders further process this representation, each responsible for predicting a different formant while considering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
