Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, and Jan Skoglund

TL;DR
This paper introduces a neural speech codec using pretrained Transformers combined with a convolutional encoder, achieving high-quality speech synthesis at ultra-low bitrates of 600 bps, outperforming traditional codecs.
Contribution
It presents a novel neural speech codec architecture that leverages pretrained Transformers for improved long-range dependency modeling at very low bitrates.
Findings
Achieves 600 bps speech coding with better quality than original neural codecs.
Subjective evaluations show comparable or superior quality to conventional codecs at higher bitrates.
Transformer-enhanced codec outperforms traditional methods in low-bitrate speech synthesis.
Abstract
Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Geophysical Methods and Applications
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Dropout · Label Smoothing
