RawNet: Fast End-to-End Neural Vocoder

Yunchao He; Yujun Wang

arXiv:1904.05351·eess.AS·March 13, 2023·1 cites

RawNet: Fast End-to-End Neural Vocoder

Yunchao He, Yujun Wang

PDF

Open Access 2 Repos

TL;DR

RawNet is an end-to-end neural vocoder that automatically learns feature extraction and speech synthesis directly from raw audio, achieving high quality and faster inference without relying on handcrafted spectral features.

Contribution

It introduces RawNet, a fully end-to-end neural vocoder that jointly trains a coder and autoregressive vocoder on raw waveforms, eliminating the need for manual feature extraction.

Findings

01

Achieves better speech quality with a simplified model architecture.

02

Provides faster speech generation at inference stage.

03

Operates effectively for both speaker-dependent and -independent synthesis.

Abstract

Neural network-based vocoders have recently demonstrated the powerful ability to synthesize high-quality speech. These models usually generate samples by conditioning on spectral features, such as Mel-spectrogram and fundamental frequency, which is crucial to speech synthesis. However, the feature extraction procession tends to depend heavily on human knowledge resulting in a less expressive description of the origin audio. In this work, we proposed RawNet, a complete end-to-end neural vocoder following the auto-encoder structure for speaker-dependent and -independent speech synthesis. It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner. The coder and voder are jointly trained directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing