Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder

Cristina G\^arbacea; A\"aron van den Oord; Yazhe Li; Felicia S C Lim,; Alejandro Luebs; Oriol Vinyals; Thomas C Walters

arXiv:1910.06464·cs.LG·October 16, 2019

Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder

Cristina G\^arbacea, A\"aron van den Oord, Yazhe Li, Felicia S C Lim,, Alejandro Luebs, Oriol Vinyals, Thomas C Walters

PDF

TL;DR

This paper presents a neural speech codec using VQ-VAE and WaveNet that achieves high perceptual quality at extremely low bit-rates, outperforming traditional codecs in some scenarios.

Contribution

The work introduces a novel neural network architecture for low bit-rate speech coding that is prosody-transparent and speaker-independent, with competitive perceptual quality.

Findings

01

Achieves perceptual quality between MELP and AMR-WB codecs at 1.6 kbps.

02

When trained on high-quality speech, matches AMR-WB quality at the same low bit-rate.

03

Demonstrates effective neural speech coding at 1.6 kbps.

Abstract

In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest · Mixture of Logistic Distributions · VQ-VAE · Dilated Causal Convolution · WaveNet