BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction   and Waveform Generation

Hui-Peng Du; Ye-Xin Lu; Yang Ai; Zhen-Hua Ling

arXiv:2406.02162·eess.AS·June 5, 2024

BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces BiVocoder, a bidirectional neural vocoder that simultaneously performs feature extraction and waveform generation in the STFT domain, improving speech synthesis quality and speed.

Contribution

The novel BiVocoder integrates feature extraction and waveform synthesis in a unified bidirectional neural network, enhancing TTS applications.

Findings

01

Outperforms baseline vocoders in speech quality

02

Supports efficient analysis-synthesis and TTS tasks

03

Balances speech quality with inference speed

Abstract

This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings