Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis
Hyun Gon Ryu, Jeong-Hoon Kim, Simon See

TL;DR
This paper introduces a novel mathematical vocoder algorithm using modified spectral inversion that synthesizes high-fidelity speech efficiently without neural network training, enabling fast, language-agnostic speech synthesis.
Contribution
It presents a non-data-driven vocoder method that bypasses neural training, allowing rapid, high-quality speech synthesis applicable across languages and voices.
Findings
Synthesizes speech at 20 MHz on CPU and 59.6 MHz on GPU
Achieves real-time speed improvements of 909x (CPU) and 2702x (GPU)
Applicable to unseen voices and multiple languages without retraining
Abstract
In this work, we propose a new mathematical vocoder algorithm(modified spectral inversion) that generates a waveform from acoustic features without phase estimation. The main benefit of using our proposed method is that it excludes the training stage of the neural vocoder from the end-to-end speech synthesis model. Our implementation can synthesize high fidelity speech at approximately 20 Mhz on CPU and 59.6MHz on GPU. This is 909 and 2,702 times faster compared to real-time. Since the proposed methodology is not a data-driven method, it is applicable to unseen voices and multiple languages without any additional work. The proposed method is expected to adapt for researching on neural network models capable of synthesizing speech at the studio recording level.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
