DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding
Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li

TL;DR
DeepA introduces a neural vocoder based on VAE architecture that extracts interpretable F0 and timbre features from speech, improving accuracy and generalization to singing compared to traditional vocoders.
Contribution
This is the first neural framework designed to extract vocoder-like parameters that are both interpretable and more accurate for signal analysis and manipulation.
Findings
DeepA outperforms WORLD in F0 estimation.
DeepA generalizes from speech to singing.
First neural vocoder-like parameter extractor.
Abstract
Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer, denoted as DeepA - a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulate those defined in conventional vocoders. Therefore, the resulting parameters are more interpretable than other latent neural representations. At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing. The proposed neural analyzer is built based on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
