TL;DR
MFCCGAN introduces an adversarial learning-based speech synthesizer that converts MFCC features into high-quality raw speech waveforms, outperforming traditional rule-based methods in intelligibility and naturalness.
Contribution
The paper presents MFCCGAN, a novel GAN-based model that directly synthesizes speech from MFCCs, improving over existing rule-based and neural vocoders in quality and intelligibility.
Findings
Outperforms Librosa MFCC-inversion in STOI and NISQA scores
Achieves higher intelligibility and naturalness compared to WORLD vocoder
Perceptual loss based on STOI enhances speech quality
Abstract
In this paper, we introduce MFCCGAN as a novel speech synthesizer based on adversarial learning that adopts MFCCs as input and generates raw speech waveforms. Benefiting the GAN model capabilities, it produces speech with higher intelligibility than a rule-based MFCC-based speech synthesizer WORLD. We evaluated the model based on a popular intrusive objective speech intelligibility measure (STOI) and quality (NISQA score). Experimental results show that our proposed system outperforms Librosa MFCC- inversion (by an increase of about 26% up to 53% in STOI and 16% up to 78% in NISQA score) and a rise of about 10% in intelligibility and about 4% in naturalness in comparison with conventional rule-based vocoder WORLD that used in the CycleGAN-VC family. However, WORLD needs additional data like F0. Finally, using perceptual loss in discriminators based on STOI could improve the quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
