Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Ali Siahkoohi; Michael Chinen; Tom Denton; W. Bastiaan Kleijn; and Jan Skoglund

arXiv:2207.02262·cs.SD·July 7, 2022

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, and Jan Skoglund

PDF

Open Access

TL;DR

This paper introduces a neural speech codec using pretrained Transformers combined with a convolutional encoder, achieving high-quality speech synthesis at ultra-low bitrates of 600 bps, outperforming traditional codecs.

Contribution

It presents a novel neural speech codec architecture that leverages pretrained Transformers for improved long-range dependency modeling at very low bitrates.

Findings

01

Achieves 600 bps speech coding with better quality than original neural codecs.

02

Subjective evaluations show comparable or superior quality to conventional codecs at higher bitrates.

03

Transformer-enhanced codec outperforms traditional methods in low-bitrate speech synthesis.

Abstract

Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Geophysical Methods and Applications

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Dropout · Label Smoothing