RawNet: Fast End-to-End Neural Vocoder
Yunchao He, Yujun Wang

TL;DR
RawNet is an end-to-end neural vocoder that automatically learns feature extraction and speech synthesis directly from raw audio, achieving high quality and faster inference without relying on handcrafted spectral features.
Contribution
It introduces RawNet, a fully end-to-end neural vocoder that jointly trains a coder and autoregressive vocoder on raw waveforms, eliminating the need for manual feature extraction.
Findings
Achieves better speech quality with a simplified model architecture.
Provides faster speech generation at inference stage.
Operates effectively for both speaker-dependent and -independent synthesis.
Abstract
Neural network-based vocoders have recently demonstrated the powerful ability to synthesize high-quality speech. These models usually generate samples by conditioning on spectral features, such as Mel-spectrogram and fundamental frequency, which is crucial to speech synthesis. However, the feature extraction procession tends to depend heavily on human knowledge resulting in a less expressive description of the origin audio. In this work, we proposed RawNet, a complete end-to-end neural vocoder following the auto-encoder structure for speaker-dependent and -independent speech synthesis. It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner. The coder and voder are jointly trained directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
